Agentified Assessment of Logical Reasoning Agents
AI 摘要
提出了一个基于Agent的逻辑推理Agent评估框架,并对auto-formalization agent进行了基准测试。
主要贡献
- 提出了一个可复现、可审计、鲁棒的Agent评估框架
- 使用评估Agent进行任务发布、执行监控和错误记录
- 在FOLIO数据集上对auto-formalization agent进行了基准测试
方法论
使用评估Agent评估推理Agent,通过标准化接口交互,评估Agent负责任务管理、输出解析和错误记录。
原文摘要
We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought baseline (73.89%).