AI Agents 相关度: 9/10

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang
arXiv: 2603.09821v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

One-Eval是一个自动化LLM评估系统,通过Agent技术实现可追踪、可定制的评估流程。

主要贡献

  • 提出Agentic评估系统One-Eval,简化LLM评估流程
  • 整合NL2Bench、BenchResolve、Metrics & Reporting等模块,实现端到端评估
  • 支持人机协作,提供调试和审计能力

方法论

利用Agent技术将自然语言评估请求转化为可执行的工作流,自动化数据集获取、模式标准化和指标选择。

原文摘要

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

标签

LLM Evaluation Agent Automation Reproducibility

arXiv 分类

cs.CL