Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis
AI 摘要
TED框架通过用户交互、自动评估和错误诊断改进Agent性能。
主要贡献
- 提出TED框架,包含用户交互、自动评估和错误诊断三个模块
- 引入基于LLM的自动评估方法,捕捉效率和中间进展
- 提供自动化错误分析工具,揭示Agent常见错误并提供改进建议
方法论
利用通用用户角色模板交互,将目标表示为自然语言评分,用LLM评估,分析judge和agent不一致之处。
原文摘要
Agent applications are increasingly adopted to automate workflows across diverse tasks. However, due to the heterogeneous domains they operate in, it is challenging to create a scalable evaluation framework. Prior works each employ their own methods to determine task success, such as database lookups, regex match, etc., adding complexity to the development of a unified agent evaluation approach. Moreover, they do not systematically account for the user's role nor expertise in the interaction, providing incomplete insights into the agent's performance. We argue that effective agent evaluation goes beyond correctness alone, incorporating conversation quality, efficiency and systematic diagnosis of agent errors. To address this, we introduce the TED framework (Talk, Evaluate, Diagnose). (1) Talk: We leverage reusable, generic expert and non-expert user persona templates for user-agent interaction. (2) Evaluate: We adapt existing datasets by representing subgoals-such as tool signatures, and responses-as natural language grading notes, evaluated automatically with LLM-as-a-judge. We propose new metrics that capture both turn efficiency and intermediate progress of the agent complementing the user-aware setup. (3) Diagnose: We introduce an automated error analysis tool that analyzes the inconsistencies of the judge and agents, uncovering common errors, and providing actionable feedback for agent improvement. We show that our TED framework reveals new insights regarding agent performance across models and user expertise levels. We also demonstrate potential gains in agent performance with peaks of 8-10% on our proposed metrics after incorporating the identified error remedies into the agent's design.