Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning
AI 摘要
论文提出GR$^3$方法,有效缓解强化学习中的长度膨胀问题,同时保持性能。
主要贡献
- 提出 Group Relative Reward Rescaling (GR$^3$) 框架
- 引入group-relative regularization和advantage-aware calibration
- 实验证明GR$^3$优于现有长度正则化方法
方法论
GR$^3$将长度控制视为乘法重缩放,结合分组相对正则化和优势感知校准,动态调整长度预算。
原文摘要
Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.