CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments
AI 摘要
CirrusBench评估LLM智能体在真实云服务环境中的性能,关注效率和用户体验。
主要贡献
- 提出了CirrusBench,一个基于真实云服务工单的评估框架。
- 引入了以客户为中心的指标,例如归一化效率指数和多轮延迟。
- 揭示了现有LLM在复杂真实场景中效率不足的问题。
方法论
构建基于真实云服务工单的评估环境,使用客户中心指标评估LLM智能体的性能。
原文摘要
The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI