Durability, observability & control

What is Agent Observability?

Also called: agent observability, LLM observability

Updated June 24, 2026
Quick Definition

Observability is the ability to understand what an agent did and why from data the system records as it runs — its steps, tool calls, inputs and outputs, intermediate decisions, and token usage. Where a final answer shows only the result, observability exposes the path that produced it, so a run can be inspected, explained, and debugged after the fact.

Why observability matters

An agent’s behavior is decided at runtime, not written out in advance, so two runs of the same agent on similar inputs can take different paths. When one produces a wrong or surprising result, the final output rarely explains it. The useful question is what the agent did along the way: which tools it called, what those calls returned, what the model decided at each step, and where things diverged from expectation.

Without that record, debugging an agent means re-running it and hoping the failure reproduces, which it often will not. Observability replaces guesswork with evidence. It is also what makes an agent operable by more than its author — support, on-call, and reviewers can all reconstruct a run instead of relying on the one person who remembers how the code is wired.

How it works

Observability comes from capturing the run as structured events rather than scattered log lines. A well-instrumented agent records, for each run:

  1. Every step the agent took, in order, with timing.
  2. Each tool call — its name, inputs, and the output or error it returned.
  3. The model’s decisions at each turn, including which action it chose and why where that is available.
  4. Token usage and cost per step, so expensive paths are visible.
  5. Errors, retries, and human approvals, so failures and pauses are part of the same record.

These events can be streamed live to watch a run as it happens, or queried afterward to reconstruct one that already finished. When the run’s state is durable, the trace is complete even for runs that crashed and resumed, because the record does not live only in the process that was executing.

Observability vs. monitoring

Monitoring and observability are related but answer different questions. Monitoring watches predefined metrics — error rate, latency, throughput — and alerts when one crosses a threshold; it tells you that something is wrong. Observability lets you ask questions you did not anticipate, reconstructing why a particular run behaved as it did from its recorded detail. Monitoring is about aggregate health; observability is about explaining individual behavior, and an agent that makes runtime decisions needs the latter to be debuggable.

In practice

A durable, observable runtime records each step of a run server-side and can stream those events live or expose them for later inspection, so a finished run can be explained rather than re-run. This pairs naturally with tracing and spans for following a single request, supplies the traces that evaluation scores, and stays complete because of durable execution. To watch a run as it happens, see streaming and live events.

Frequently asked questions

What is the difference between observability and monitoring?

Monitoring tracks known metrics and alerts when they cross a threshold, answering whether the system is healthy. Observability lets you ask new questions after the fact — reconstructing why a specific run behaved as it did from its recorded steps and decisions, not just whether an aggregate looked normal.

What should you capture to make an agent observable?

At minimum: each step the agent took, every tool call with its inputs and outputs, the model's intermediate decisions, token usage, errors and retries, and any human approvals. Together these let you replay a run's reasoning rather than guessing from a final answer.

How is observability different from evaluation?

Observability records what happened in a run so it can be inspected. Evaluation judges whether what happened was correct or good, often across many runs. Observability supplies the detailed traces that evaluation then scores.

See also in the docs

Related terms