AI Agents 相关度: 9/10

Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor
arXiv: 2603.29665v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

该论文提出了一种检测 Agent 工作流中潜在策略失败的新方法,即使结果正确,也能识别未遵循策略检查的情况。

主要贡献

  • 提出“近失(Near-Miss)”或“潜在失败(Latent Failures)”的概念,用于描述 Agent 绕过策略检查但最终结果正确的情况。
  • 提出一种新的指标,用于检测 Agent 对工具的调用决策是否充分知情,从而识别潜在的策略失败。
  • 在 Airlines 基准测试上验证了该方法的有效性,表明现有评估方法存在盲点。

方法论

该方法基于 ToolGuard 框架,将自然语言策略转换为可执行的 guard 代码,并分析 Agent 轨迹以评估其工具调用决策。

原文摘要

Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as $\textit{near-misses}$ or $\textit{latent failures}$. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

标签

AI Agents Policy Compliance LLM Evaluation Agent Workflows

arXiv 分类

cs.CL