Durability, observability & control

What is Durable Execution?

Also called: durable agents, durable workflow execution

Updated June 24, 2026
Quick Definition

Durable execution is an execution model in which an agent’s state — every step it has taken, every tool result, and every pending approval — is persisted outside the process that is running it. If that process crashes, is restarted, or is redeployed, the run resumes from the last completed step instead of starting over.

Why durable execution matters

Most agent frameworks run the reasoning loop inside a single process and keep the entire run in memory: the conversation so far, which tools have been called, and what they returned. That is fine in a notebook, but in production it is fragile. If the process dies — an out-of-memory kill, a deploy, a spot-instance reclaim, a crashed dependency — the run is gone. There is no way to pick it back up, so the only option is to start again from the first token.

That failure mode is expensive in three ways. Work already done (LLM calls, tool calls, payments, emails) is repeated or lost. Any side effects that already happened may run a second time. And anything waiting on a human — an approval that has been sitting for an hour — simply disappears when the process restarts.

How it works

Durable execution separates defining an agent from running it. The definition (the model, the tools, the control flow) is compiled into a workflow whose state lives in durable storage rather than in process memory. Each step is recorded as it completes:

  1. The agent decides on the next action (a tool call, a sub-agent handoff, or a final answer).
  2. The action runs, and its result is checkpointed to durable storage.
  3. If the worker executing the step dies, another worker reads the recorded state and continues from the next uncompleted step — completed steps are never re-run.

Because progress is recorded outside the process, the same run can be observed, paused, resumed, or cancelled from a different machine entirely. This is the same execution model that durable workflow engines have used for years to run mission-critical business processes; applying it to the agent loop is what makes an agent reliable enough to run unattended.

Durable execution vs. caching

Caching and memoization can avoid repeating an expensive call, but they do not preserve the position of a run — which branch it was on, what it was waiting for, or what it had already decided. Durable execution preserves the full control-flow state, which is what allows a run to resume mid-flight rather than merely skip a recomputation.

In practice

A durable runtime persists each step of an agent server-side and reassigns work to a healthy worker when one fails, so a crash becomes a pause rather than a restart. This is the foundation that crash recovery, human-in-the-loop approvals, and queryable observability are built on. For a deeper walkthrough, see why durable agents and the crash-and-resume example.

Frequently asked questions

What is the difference between durable execution and retries?

Retries re-run a failed step. Durable execution persists the outcome of every completed step, so after a failure the run continues from where it stopped rather than repeating work that already succeeded.

Does durable execution make agents slower?

It adds a small amount of latency to persist state at each step, but that cost is usually negligible next to LLM and tool-call latency — and it removes the much larger cost of re-running an entire agent from scratch after a crash.

Is durable execution only useful for long-running agents?

It is most valuable for runs that are long, expensive, or that pause for human approval, but any agent that touches external systems benefits from not repeating side effects after a failure.

See also in the docs

Related terms