Evaluation — often shortened to evals — is the practice of measuring how well an agent performs by scoring its runs against defined criteria. Rather than asking whether the code runs, evaluation asks whether the agent does its job: it grades outcomes such as task success, tool-call accuracy, and faithfulness so quality can be tracked and compared as the agent changes.
Why evaluation matters
An agent’s behavior is non-deterministic and emerges from a model, so you cannot establish its quality by reading the code. A change that looks harmless — a new prompt, a swapped model, an added tool — can quietly make the agent worse on cases you do not happen to try by hand. Spot-checking a few runs does not catch this, because the failures are spread across inputs you did not test.
Evaluation makes quality measurable and repeatable. By running the agent over a fixed set of cases and scoring the results, you get a number that can be compared before and after a change, so a regression shows up as a drop rather than as a surprise in production. It also turns vague goals into concrete targets: deciding what to measure forces you to define what good behavior actually means for this agent. That measured baseline is what lets a team iterate with confidence instead of guessing.
How it works
Evaluation runs the agent against known cases and scores the outcomes:
- Assemble a dataset of representative inputs, ideally with expected outcomes or reference answers where they exist.
- Run the agent over each case, capturing not just the final output but the steps it took to get there.
- Score each run against the chosen metrics — task success, tool-call accuracy, faithfulness to retrieved evidence, latency, or others suited to the task.
- Aggregate the scores into measures you can track over time and compare across versions.
Scoring uses whatever fits the metric: exact matching or assertions for outputs with a clear right answer, and an LLM-as-judge — a model scoring outputs against a rubric — for open-ended responses that no single string can capture.
Evaluation vs. observability
Evaluation and observability are complementary. Observability shows what happened on a given run — the steps, the tool calls, the timings — and is essential for diagnosing a specific failure. Evaluation judges how good behavior is against criteria, typically aggregated across many runs, and is what tells you whether the agent is improving. Observability explains an individual case; evaluation measures the population.
In practice
A durable, observable runtime captures the full trace of every run, which gives evaluation more to score than the final answer alone — the recorded steps let you grade tool-call accuracy and the reasoning loop, not just the output. The same traces that power tracing become evaluation inputs, and evaluation findings inform where guardrails are needed. For testing and evaluating agents, see testing.