Agent Tuning & Optimization 相关度: 9/10

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, Xing Yu
arXiv: 2603.10535v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

论文提出GR$^3$方法,有效缓解强化学习中的长度膨胀问题,同时保持性能。

主要贡献

  • 提出 Group Relative Reward Rescaling (GR$^3$) 框架
  • 引入group-relative regularization和advantage-aware calibration
  • 实验证明GR$^3$优于现有长度正则化方法

方法论

GR$^3$将长度控制视为乘法重缩放,结合分组相对正则化和优势感知校准,动态调整长度预算。

原文摘要

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

标签

强化学习 长度膨胀 奖励重塑 LLM

arXiv 分类

cs.LG cs.CL