Capability Ladder

Agentic AI is a reliability problem

A capability ladder for agentic AI: what each level enables, what breaks, and what upgrades make reliability real.

Kaizhi Tang

06 Feb 2026 • 2 min read

Agentic AI is a reliability problem

Most debates about agentic AI are framed as intelligence questions: Can it plan? Can it reason? Can it use tools?

But in practice, the difference between a toy agent and a useful agent is simpler:

Can you trust it to behave the same way twice?

Agency is not just capability. It’s capability under repetition.

Below is a ladder I use to think about agentic systems—from “clever autocomplete” to “operational coworker”—with the failure mode that kills each level.

Level 0 — Text completion

What it does: predicts the next token.

Failure mode: confident nonsense.

At this level, the model is impressive but not accountable. There’s no stable interface with the world.

Level 1 — Single-shot tool use

What it does: calls one tool when prompted (search, calculator, API call).

Failure mode: silent tool failure.

If the tool errors, or returns partial data, the model often “fills in” the missing pieces. Reliability begins with respecting error states.

Level 2 — Multi-step plans

What it does: decomposes a task into steps.

Failure mode: plan drift.

The plan looks great at the start, then slowly mutates. The agent forgets constraints, changes objectives midstream, or optimizes for easy progress.

The fix isn’t “better planning.” It’s persistent state + constraint checking.

Level 3 — Toolchains with memory

What it does: executes sequences of tools, stores intermediate artifacts, resumes.

Failure mode: state corruption.

If you can’t trust the state, you can’t trust the agent. A corrupted memory file, a partial database write, a stale cache—suddenly the agent is “acting” on a hallucinated world.

This is where agents become software engineering, not prompting.

Level 4 — Self-checking and rollback

What it does: verifies outputs, detects anomalies, retries safely.

Failure mode: false confidence in self-evals.

Agents that grade themselves often become persuasive rather than correct. You need external checks: unit tests, invariants, human approval gates.

Level 5 — Operational coworker

What it does: runs daily workflows with bounded autonomy.

Failure mode: incentive misalignment.

At scale, the agent will optimize for the metric it’s rewarded on: speed over accuracy, output volume over quality, “looks done” over “is correct.”

The cure is boring: explicit goals, limited permissions, audit logs, and deliberate friction.

The takeaway

If you want agentic AI that works, treat it like production software:

define invariants
design for failure
log everything
add rollback

The frontier isn’t “more autonomy.” It’s more reliability per unit autonomy.