Durability, observability & control

What is Agent Evaluation (Evals)?

Also called: evals, agent evaluation

Updated June 24, 2026
Quick Definition

Evaluation — often shortened to evals — is the practice of measuring how well an agent performs by scoring its runs against defined criteria. Rather than asking whether the code runs, evaluation asks whether the agent does its job: it grades outcomes such as task success, tool-call accuracy, and faithfulness so quality can be tracked and compared as the agent changes.

Why evaluation matters

An agent’s behavior is non-deterministic and emerges from a model, so you cannot establish its quality by reading the code. A change that looks harmless — a new prompt, a swapped model, an added tool — can quietly make the agent worse on cases you do not happen to try by hand. Spot-checking a few runs does not catch this, because the failures are spread across inputs you did not test.

Evaluation makes quality measurable and repeatable. By running the agent over a fixed set of cases and scoring the results, you get a number that can be compared before and after a change, so a regression shows up as a drop rather than as a surprise in production. It also turns vague goals into concrete targets: deciding what to measure forces you to define what good behavior actually means for this agent. That measured baseline is what lets a team iterate with confidence instead of guessing.

How it works

Evaluation runs the agent against known cases and scores the outcomes:

  1. Assemble a dataset of representative inputs, ideally with expected outcomes or reference answers where they exist.
  2. Run the agent over each case, capturing not just the final output but the steps it took to get there.
  3. Score each run against the chosen metrics — task success, tool-call accuracy, faithfulness to retrieved evidence, latency, or others suited to the task.
  4. Aggregate the scores into measures you can track over time and compare across versions.

Scoring uses whatever fits the metric: exact matching or assertions for outputs with a clear right answer, and an LLM-as-judge — a model scoring outputs against a rubric — for open-ended responses that no single string can capture.

Evaluation vs. observability

Evaluation and observability are complementary. Observability shows what happened on a given run — the steps, the tool calls, the timings — and is essential for diagnosing a specific failure. Evaluation judges how good behavior is against criteria, typically aggregated across many runs, and is what tells you whether the agent is improving. Observability explains an individual case; evaluation measures the population.

In practice

A durable, observable runtime captures the full trace of every run, which gives evaluation more to score than the final answer alone — the recorded steps let you grade tool-call accuracy and the reasoning loop, not just the output. The same traces that power tracing become evaluation inputs, and evaluation findings inform where guardrails are needed. For testing and evaluating agents, see testing.

Frequently asked questions

What is the difference between evaluation and observability?

Observability shows you what an agent did on a given run. Evaluation judges how good that behavior was against defined criteria, usually across many runs. Observability describes; evaluation scores.

What should you measure when evaluating an agent?

Common metrics include task success — did the run achieve its goal — along with tool-call accuracy, whether the right tools were called with the right arguments, and faithfulness, whether the output is grounded in the evidence the agent retrieved. The mix depends on what the agent is for.

What is LLM-as-judge?

LLM-as-judge uses a language model to score another agent's output against a rubric, such as rating relevance or correctness. It scales to open-ended outputs that have no single right answer, though its judgments must themselves be validated against human ratings.

See also in the docs

Related terms