AI Signals & Reality Checks: Model Risk Budgets (Safety as an SLO)
AI Signals & Reality Checks (Feb 24, 2026)
Signal
“Evaluation” is turning into a continuous production discipline—because teams are starting to treat AI risk like an SLO, not a policy memo.
In the last year, a lot of AI programs have followed the same arc:
- A model passes an offline benchmark.
- It ships behind a feature flag.
- A week later, something weird happens in the real world (a subtle hallucination, a policy violation, a customer trust incident, or an automation that silently drifted).
The natural response is “we need better evals.” But what’s emerging now is more specific: model risk budgets.
A risk budget is not “be safe.” It’s a statement like:
- In this workflow, we can tolerate at most 1/10,000 runs producing a materially wrong write.
- We can tolerate 1/100 runs producing a low-stakes factual error—if it’s clearly labeled and never auto-written to a system of record.
- We can tolerate 1/1,000 runs triggering a policy boundary (PII exposure, disallowed content)—but only if the boundary is blocked, logged, and routed to review.
Once teams start talking this way, product and engineering choices change:
- Guardrails become measurable. You track “block rate,” “escalation rate,” and “unsafe completion rate,” not just “accuracy.”
- Automation gets tiered. Read-only actions and draft outputs can have a looser budget; writes and irreversible actions need a much tighter one.
- Rollbacks become first-class. If risk spikes, you need a clean path to downgrade autonomy (auto → approve → draft) without redeploying the world.
In other words: the industry is drifting toward a mental model borrowed from reliability engineering.
- Reliability isn’t “never fail.” It’s “fail within a budget, detect quickly, recover cleanly.”
- Model safety in production is starting to look the same: explicit failure budgets, live monitoring, and fast mitigation loops.
Reality check
Risk budgets only work if you can (1) measure failures, (2) assign ownership, and (3) align incentives. Most teams currently have gaps in all three.
Here are the three friction points that show up immediately:
- You can’t budget what you can’t instrument Offline evals are easy to count. Real-world failures are not—especially when the “failure” is a near miss.
If you want a meaningful risk budget, you need event definitions that can be captured at runtime:
- Write correctness: did the agent write the right value to the right field?
- Evidence integrity: did the output cite the right source snippet / artifact?
- Policy boundaries: did the run attempt something disallowed (even if blocked)?
- User harm proxies: did the user have to undo, re-run, or escalate?
Then you need logging that supports before/after diffs and reason codes (why a block happened, why a human approval was required). Otherwise, “risk” becomes a vague feeling.
- Ownership breaks when the system is a stack of vendors A typical production agent involves:
- a foundation model provider,
- an orchestration layer,
- retrieval/search,
- internal tools/APIs,
- and your product UI.
When something goes wrong, the failure can be anywhere in the chain. If you don’t assign ownership per layer (and per metric), risk budgets turn into finger-pointing.
A practical pattern is to name a single “risk owner” for each workflow, and require:
- a documented autonomy tier (draft/approve/auto),
- a rollback trigger (what metrics force a downgrade),
- and a weekly review of top incidents + near misses.
- If incentives reward shipping, budgets will be ignored Teams often set “safety targets” that compete with OKRs like activation, retention, and cost reduction. When those collide, risk budgets lose.
What works better is to make risk budgets part of the shipping gate:
- no expansion of autonomy without meeting the budget for N days,
- automatic throttling when boundary attempts spike,
- and post-incident “error bars” that reduce allowed autonomy until confidence is rebuilt.
Bottom line: the next wave of AI maturity is less about discovering new capabilities and more about operationalizing predictable behavior. The teams that win won’t just have strong models—they’ll have risk budgets that are measurable, enforceable, and reversible.
中文翻译(全文)
AI Signals & Reality Checks(2026 年 2 月 24 日)
信号
“评测(evaluation)”正在变成一种持续的生产纪律——因为团队开始把 AI 风险当作 SLO(服务等级目标),而不是一份写在墙上的政策宣言。
过去一年,很多 AI 项目都走过类似路径:
- 模型在线下基准测试上表现不错。
- 通过 feature flag 上线。
- 一周后在真实世界里出现“怪事”(微妙的幻觉、策略违规、客户信任事故,或自动化在悄悄漂移)。
直觉反应往往是:“我们需要更好的 eval。”但现在出现的变化更具体:模型风险预算(model risk budgets)。
风险预算不是一句“要安全”。它更像是明确的、可量化的约束,例如:
- 在这个工作流里,我们最多容忍每 10,000 次运行中有 1 次产生“实质性错误写入”。
- 我们可以容忍每 100 次运行里有 1 次出现低风险事实错误——前提是清楚标注,并且绝不自动写入系统记录(system of record)。
- 我们可以容忍每 1,000 次运行里有 1 次触发策略边界(PII 暴露、禁止内容)——但必须被阻止、记录,并路由到人工复核。
一旦团队开始用这种语言讨论,产品与工程的选择会随之改变:
- **护栏会变得可度量。**你会追踪“拦截率(block rate)”“升级率(escalation rate)”“不安全输出率”,而不只是“准确率”。
- **自动化会分层。**只读动作、草稿输出可以用更宽松的预算;写入与不可逆动作需要更严格的预算。
- **回滚会成为一等公民。**当风险指标变差时,你需要能快速把自治级别从“自动”降到“审批”或“草稿”,而不是重新部署一切。
换句话说:行业正在向一种从可靠性工程借来的心智模型靠拢。
- 可靠性不是“永不失败”,而是“在预算内失败、快速发现、干净恢复”。
- 生产环境的模型安全也越来越像这样:明确的失败预算、在线监控、快速的缓解闭环。
现实校验
风险预算只有在(1)能测量失败,(2)能明确责任归属,(3)能对齐激励时才成立。但大多数团队在这三点上都存在缺口。
最先碰到的三个摩擦点通常是:
- 没有可观测性,就谈不上预算 线下 eval 的分数很好统计;真实世界的失败很难——尤其是“险些出事”的 near miss。
如果要让风险预算有意义,你需要在运行时可捕捉的事件定义,例如:
- *写入正确性:*智能体是否把正确的值写到正确字段?
- *证据完整性:*输出是否引用了正确的源片段/工件?
- *策略边界:*是否尝试了被禁止的动作(即使最终被阻止)?
- *用户伤害代理指标:*用户是否需要撤销、重跑或升级到人工?
接着,你需要能支持 前后差异(before/after diff) 与 原因码(reason codes) 的日志体系(为什么被拦截、为什么需要人工审批)。否则,“风险”就会变成一种模糊的感觉。
- 当系统是多家供应商叠起来的堆栈时,责任会断裂 一个典型的生产智能体往往包含:
- 基座模型提供方,
- 编排/代理层,
- 检索与搜索,
- 内部工具/API,
- 以及你的产品界面。
出问题时,故障可能发生在链条的任何环节。如果你没有为每一层(以及每个指标)明确责任人,风险预算很快会变成“互相甩锅”。
一个可行的做法是:为每个工作流指定单一“风险负责人”,并要求:
- 明确自治级别(草稿/审批/自动),
- 明确回滚触发条件(哪些指标会强制降级),
- 每周复盘 Top 事故 + near miss。
- 如果激励只奖励上线速度,预算就会被忽略 团队经常设定与增长 OKR(激活、留存、降本)竞争的“安全目标”。当冲突发生时,风险预算通常会输。
更有效的方式,是把风险预算做成“上线闸门”的一部分:
- 未满足连续 N 天预算要求,就不能扩大自治级别,
- 策略边界尝试激增时自动限流/降级,
- 事故后以“误差条(error bars)”的形式降低允许自治度,直到信心重建。
结论:下一波 AI 成熟度提升,重点不在于发现新能力,而在于把行为变得可预测、可运营。真正胜出的团队,不仅模型强,还会拥有可度量、可执行、可回滚的风险预算体系。