Durability, observability & control

What are AI Agent Guardrails?

Also called: AI guardrails, agent guardrails

Updated June 24, 2026
Quick Definition

Guardrails are validation checks placed around an agent that enforce rules on what it receives and what it produces. They inspect inputs before the agent acts and outputs before they are returned, and when a check fails they intervene — blocking, correcting, or escalating — so the agent’s non-deterministic behavior stays within defined bounds.

Why guardrails matter

An agent chooses its own actions at runtime, which means it can do things its author did not intend: act on a malicious instruction hidden in an input, call a tool with unsafe arguments, return content that violates policy, or produce output in a shape the next system cannot consume. Because the model’s behavior is not fully predictable, you cannot rely on the prompt alone to keep every run in line.

Guardrails address this by making safety a property of the system rather than a hope about the model. They turn implicit expectations — this field must be valid, this category of request must be refused, this output must match a schema — into explicit checks that run every time. That matters most in production, where an agent acts on real systems and a single bad action can have real consequences. Guardrails bound the blast radius without requiring the model itself to be perfect.

How it works

Guardrails wrap the agent at its edges and run as part of each step:

  1. Input validation runs before the agent acts, inspecting the prompt or a tool argument and rejecting or sanitising anything that breaks a rule.
  2. Output validation runs after the agent produces a result, checking it against constraints such as format, content policy, or required fields.
  3. When a check passes, the run proceeds unchanged.
  4. When a check fails, a configured failure mode decides what happens next — commonly retry the step, raise an error, attempt to fix the result, or escalate to a human.

Because the checks are explicit and run on every invocation, they apply uniformly regardless of which path the model chose, and each pass or failure is recorded as part of the run.

Guardrails vs. evaluation

Guardrails and evaluation both concern quality but operate at different times. A guardrail runs inline during a live run and can change its course in the moment, so it is an enforcement mechanism. Evaluation runs after the fact, usually across many runs, to measure how well the agent performs, so it is a measurement mechanism. The two are complementary: evaluation reveals where an agent tends to go wrong, and guardrails enforce limits on those behaviors in production.

In practice

A durable, observable runtime runs guardrails as part of each step and records every pass and failure, so violations are visible rather than silent. When a check fails, the configured mode can retry, raise, fix, or hand off to a person, which connects guardrails to human-in-the-loop review and approval gates. They are especially important around tool use, where an unchecked argument becomes a real action. For input and output validation and the available failure modes, see guardrails.

Frequently asked questions

What is the difference between input and output guardrails?

An input guardrail checks what goes into the agent — the user prompt or a tool argument — before it acts, catching things like injection attempts or disallowed requests. An output guardrail checks what the agent produces before it is returned, catching unsafe, malformed, or off-policy results.

What is the difference between guardrails and evaluation?

Guardrails run inline during a live run and can block or change behavior in the moment. Evaluation measures quality after the fact, usually over many runs, to judge how well the agent performs. Guardrails enforce; evaluation grades.

What happens when a guardrail fails?

Depending on the configured mode, the run can retry the step, raise an error and stop, attempt to fix the output to satisfy the rule, or escalate to a human for a decision. The choice depends on how recoverable the violation is.

See also in the docs

Related terms