Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
AI 摘要
通过动态梯度裁剪实现强化学习中LLM策略熵的精确控制,有效缓解熵坍塌问题。
主要贡献
- 提出基于梯度保留裁剪的熵控制视角
- 理论和实验验证了重要性采样比率对熵变化的影响
- 设计了动态裁剪阈值和熵控制策略
方法论
理论分析梯度裁剪对熵的影响,并设计动态裁剪阈值来精确控制熵,结合多种动态熵控制策略进行实验验证。
原文摘要
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.