Capability Ladder

Evaluation Ladder: Making agent reliability measurable

Kaizhi Tang

06 Feb 2026 • 3 min read

Evaluation Ladder: making agent reliability measurable

If you can’t measure reliability, you can’t improve it. And with agentic systems, “it worked in the demo” is the most expensive lie.

The trap is evaluating agents like you evaluate chat: a handful of prompts, a vibe check, maybe a rubric. But agents fail in ways that only appear across tool calls, time, and state.

Here’s an evaluation ladder for agentic AI—each rung is a more realistic (and more useful) measurement of reliability. Each rung also has a failure mode, because evaluation itself is a system that can break.

Level 0 — Demo prompts

What you measure: a few hand-picked tasks.

Failure mode: selection bias.

You end up measuring what the system is already good at, and you miss the weird edge cases that will dominate real usage.

Practical fix: force diversity: random sampling from production-like tasks, plus a small “nightmare set” of adversarial prompts.

Level 1 — Golden answers (static correctness)

What you measure: does the output match an expected answer?

Failure mode: overfitting to surface form.

Agents can learn to imitate the shape of an answer without doing the right work (especially when tools are involved).

Practical fix: grade on invariants (“must cite sources”, “must not invent tool output”, “must respect file paths”), not just text similarity.

Level 2 — Tool-call integrity

What you measure: whether tool invocations are valid, deterministic, and error-aware.

Failure mode: hallucinated tool success.

The agent calls a tool, it errors, and the agent proceeds as if it succeeded.

Practical fix: make tool states explicit and score them:

did the agent surface the error message?
did it stop or retry safely?
did it avoid fabricating outputs?

Pipeline lesson (concrete): in OpenClaw’s workflow norms, we treat “don’t invent commands” as a reliability invariant. If we’re unsure, we consult local docs first (or ask the human). That’s not etiquette—it’s eval design. You can literally score “did the agent guess a CLI flag?” as a failure.

Level 3 — Multi-step task success (end-to-end)

What you measure: whether the final artifact is correct (a clean PR, a valid post, a working script).

Failure mode: hidden partial credit.

The agent produces something that looks finished but violates a key constraint (wrong tag order, missing file, wrong environment). Humans often accept it anyway because it’s close.

Practical fix: add machine checks. If the task has structure, validate it:

file exists at required path
frontmatter schema matches
word count within bounds
links resolve

This transforms “almost” into “fail fast.”

Level 4 — Regression and drift (over time)

What you measure: performance across versions and weeks.

Failure mode: quiet decay.

A prompt change, a model update, or a new tool version subtly shifts behavior. Nobody notices until a user does.

Practical fix: treat agent workflows like software releases:

version your prompts/tool schemas
run a regression suite on every change
alert on metric deltas, not just absolute scores

Level 5 — Robustness under perturbation

What you measure: stability when the world is messy:

network failures
rate limits
partial data
tool timeouts
ambiguous user requests

Failure mode: brittle cleverness.

The agent is brilliant on clean inputs and collapses under real conditions.

Practical fix: inject failure on purpose. Chaos engineering, but for agents. Randomly fail tools and score whether the agent degrades gracefully.

Level 6 — Adversarial evaluation (misuse + misalignment)

What you measure: whether the agent can be induced to violate constraints.

Failure mode: persuasion beats policy.

This is where “prompt injection” lives, but it’s broader: users will ask for shortcuts, and agents will be tempted to comply.

Practical fix: test the social layer:

requests to skip confirmation
requests to fabricate citations
requests to run destructive commands

Real system example: Mata v. Avianca (2023) is remembered as “ChatGPT hallucinated citations.” But the deeper lesson is evaluation failure: no invariant existed for “citations must be verifiable.” If your system doesn’t test for “source integrity,” it will eventually ship confident fiction.

Level 7 — Cost + reliability frontier

What you measure: the tradeoff curve: reliability per unit cost/latency.

Failure mode: winning the wrong metric.

If you only optimize for cost, you’ll create an agent that’s cheap and wrong. If you only optimize for reliability, you may build something too slow to use.

Practical fix: pick an operating point explicitly. For high-risk actions, pay for higher confidence and add gates. For low-risk drafting, accept lower reliability but keep outputs reviewable.

The takeaway

Reliability isn’t a single score; it’s a ladder of evidence.

When someone claims an agent is “good,” ask: which rung are you on?

demo prompts?
end-to-end checks?
regression under drift?
adversarial robustness?

Once you can answer that, you can finally do the boring, powerful work: build evals that match reality—and then climb.