Crash recovery is the ability of an agent to continue a run after the process or machine executing it has died. Instead of starting over from the first step, a healthy worker reads the run’s persisted state and resumes from the last completed step. It is the user-visible payoff of durable execution: a crash becomes a pause rather than a total loss.
Why crash recovery matters
Agents fail in the middle of work more often than people expect. A long research run is killed by an out-of-memory limit, a deploy restarts every worker, a spot instance is reclaimed, or a dependency crashes. If the entire run lived in that process’s memory, there is nothing left to recover and the only option is to begin again from scratch.
That restart is costly in three distinct ways. Expensive work already done — model calls, tool calls, retrieved documents — is paid for a second time. Side effects that already happened, such as a payment or an email, may run again because the system does not know they completed. And anything that was waiting on a person, like an approval sitting for an hour, disappears. Crash recovery removes all three costs by making the run’s progress outlive the process.
How it works
Crash recovery is a consequence of how state is stored, not a feature bolted on afterward:
- As each step of the agent completes, its result is checkpointed to durable storage rather than kept only in process memory.
- When a worker dies, the orchestration layer detects that the step it was running did not finish.
- The work is reassigned to a healthy worker, which loads the recorded state for the run.
- Execution continues from the first step that had not completed; every step that already finished is skipped, so its result and side effects are not produced again.
Because the run’s position lives outside any single process, recovery can happen on a different machine entirely, and a redeploy that replaces every worker does not lose in-flight runs.
Crash recovery vs. retries
A retry re-runs a single step that failed, which is useful for transient errors but assumes the surrounding run is still alive in memory. Crash recovery handles the case where the whole process is gone: it restores the run’s full position from durable state and continues, rather than re-executing one operation. Retries and crash recovery often work together — retries handle a flaky tool call, while crash recovery handles a dead worker.
In practice
A durable, observable runtime makes crash recovery automatic by persisting each step server-side and reassigning work to a healthy worker when one fails. This rests directly on durable execution and the state management that records each step, and it is what lets a human-in-the-loop pause survive a restart. For a hands-on walkthrough, see the crash-and-resume example.