One-Eval: An Agentic System for Automated and Traceable LLM Evaluation
AI 摘要
One-Eval是一个自动化LLM评估系统,通过Agent技术实现可追踪、可定制的评估流程。
主要贡献
- 提出Agentic评估系统One-Eval,简化LLM评估流程
- 整合NL2Bench、BenchResolve、Metrics & Reporting等模块,实现端到端评估
- 支持人机协作,提供调试和审计能力
方法论
利用Agent技术将自然语言评估请求转化为可执行的工作流,自动化数据集获取、模式标准化和指标选择。
原文摘要
Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.