Multimodal Learning 相关度: 9/10

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai, Mengyu Zhou, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

arXiv: 2603.16600v1 发布: 2026-03-17 更新: 2026-03-17

下载 PDF arXiv 页面

AI 摘要

提出Proxy-GRM，通过代理引导的评价标准验证，提升视觉-语言模型奖励模型的标准质量。

主要贡献

提出Proxy-GRM框架，显式优化奖励模型的中间评价标准。
引入轻量级代理，预测偏好排序，并以此作为评价标准质量的奖励。
实验证明Proxy-GRM在多个基准测试中达到SOTA，且评价标准具有迁移性。

方法论

使用轻量级代理（Proxy-SFT和Proxy-RL）预测偏好排序，预测准确度作为评价标准质量的奖励，指导模型生成更优评价标准。

原文摘要

Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy's prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类