Agent Tuning & Optimization 相关度: 8/10

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Xingyao Wang, Valerie Chen, Heng Ji, Graham Neubig

arXiv: 2603.03800v1 发布: 2026-03-04 更新: 2026-03-04

下载 PDF arXiv 页面

AI 摘要

提出一种基于规则的监督框架，从稀疏真实数据中学习代码代理的评价模型，提升代码生成任务性能。

主要贡献

提出Critic Rubrics框架，利用行为特征和稀疏反馈学习评价模型
证明评价模型可用于重排序、提前停止和数据筛选
在SWE-bench数据集上验证了方法的有效性

方法论

使用半监督目标，联合预测规则和稀疏人类反馈，训练评价模型，用于奖励塑造和轨迹选择。

原文摘要

Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.

arXiv 分类

cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类