Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models
AI 摘要
提出Proxy-GRM,通过代理引导的评价标准验证,提升视觉-语言模型奖励模型的标准质量。
主要贡献
- 提出Proxy-GRM框架,显式优化奖励模型的中间评价标准。
- 引入轻量级代理,预测偏好排序,并以此作为评价标准质量的奖励。
- 实验证明Proxy-GRM在多个基准测试中达到SOTA,且评价标准具有迁移性。
方法论
使用轻量级代理(Proxy-SFT和Proxy-RL)预测偏好排序,预测准确度作为评价标准质量的奖励,指导模型生成更优评价标准。
原文摘要
Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.