Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
AI 摘要
该论文分析了LLM Agent在云RCA中的失败原因,并提出了改进Agent架构的方法。
主要贡献
- 提出了LLM Agent在云RCA中失败的12种类型
- 通过实验证明,通用模型能力不是RCA失败的主要原因
- 表明改进Agent架构比Prompt工程更能有效提高RCA准确率
方法论
通过在OpenRCA基准上运行多个LLM模型,分析Agent在推理、通信和环境交互中的失败模式。
原文摘要
Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.