Durability, observability & control

What is Crash Recovery (Agent Resumption)?

Also called: agent resumption, resume

Updated June 24, 2026
Quick Definition

Crash recovery is the ability of an agent to continue a run after the process or machine executing it has died. Instead of starting over from the first step, a healthy worker reads the run’s persisted state and resumes from the last completed step. It is the user-visible payoff of durable execution: a crash becomes a pause rather than a total loss.

Why crash recovery matters

Agents fail in the middle of work more often than people expect. A long research run is killed by an out-of-memory limit, a deploy restarts every worker, a spot instance is reclaimed, or a dependency crashes. If the entire run lived in that process’s memory, there is nothing left to recover and the only option is to begin again from scratch.

That restart is costly in three distinct ways. Expensive work already done — model calls, tool calls, retrieved documents — is paid for a second time. Side effects that already happened, such as a payment or an email, may run again because the system does not know they completed. And anything that was waiting on a person, like an approval sitting for an hour, disappears. Crash recovery removes all three costs by making the run’s progress outlive the process.

How it works

Crash recovery is a consequence of how state is stored, not a feature bolted on afterward:

  1. As each step of the agent completes, its result is checkpointed to durable storage rather than kept only in process memory.
  2. When a worker dies, the orchestration layer detects that the step it was running did not finish.
  3. The work is reassigned to a healthy worker, which loads the recorded state for the run.
  4. Execution continues from the first step that had not completed; every step that already finished is skipped, so its result and side effects are not produced again.

Because the run’s position lives outside any single process, recovery can happen on a different machine entirely, and a redeploy that replaces every worker does not lose in-flight runs.

Crash recovery vs. retries

A retry re-runs a single step that failed, which is useful for transient errors but assumes the surrounding run is still alive in memory. Crash recovery handles the case where the whole process is gone: it restores the run’s full position from durable state and continues, rather than re-executing one operation. Retries and crash recovery often work together — retries handle a flaky tool call, while crash recovery handles a dead worker.

In practice

A durable, observable runtime makes crash recovery automatic by persisting each step server-side and reassigning work to a healthy worker when one fails. This rests directly on durable execution and the state management that records each step, and it is what lets a human-in-the-loop pause survive a restart. For a hands-on walkthrough, see the crash-and-resume example.

Frequently asked questions

How is crash recovery possible at all?

It depends on durable state. Because the outcome of every completed step is persisted outside the process running the agent, a new worker can read that state after a failure and continue from where the run stopped.

Are already-completed steps re-run when an agent resumes?

No. Steps that finished have their results recorded, so resumption picks up at the first step that had not completed. Completed work — including its side effects — is not repeated.

What kinds of failures does crash recovery cover?

It covers failures that take down the process or machine without corrupting persisted state: out-of-memory kills, deploys and rolling restarts, spot-instance reclaims, crashed dependencies, and hardware loss. The run survives because its state does not live only in that process.

See also in the docs

Related terms