Towards a Science of AI Agent Reliability
AI 摘要
论文提出12个指标,从一致性、鲁棒性、可预测性、安全性四个维度评估AI Agent的可靠性。
主要贡献
- 提出了12个用于评估AI Agent可靠性的新指标
- 从四个维度分解Agent的可靠性:一致性、鲁棒性、可预测性和安全性
- 评估了14个Agent模型,揭示了现有Agent的可靠性瓶颈
方法论
定义了12个可靠性指标,并在两个benchmark上评估了14个agent模型,分析了它们的性能和局限性。
原文摘要
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.