AI Agents 相关度: 9/10

Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra
arXiv: 2602.16246v1 发布: 2026-02-18 更新: 2026-02-18

AI 摘要

提出了一种基于代理状态评估的可扩展验证奖励框架,用于评估多轮工具调用LLM Agent。

主要贡献

  • 提出了基于代理状态评估的LLM Agent评估框架。
  • 该框架利用LLM进行状态跟踪和目标完成度验证,无需确定性后端。
  • 实验表明该框架具有稳定、可区分模型的排名能力,且可用于生成高质量的训练数据。

方法论

使用LLM状态跟踪器从交互轨迹推断代理状态,并使用LLM判断器根据场景约束验证目标完成情况。

原文摘要

Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

标签

LLM Agents Evaluation Tool Calling Scalability

arXiv 分类

cs.AI