Agent Tuning & Optimization 相关度: 8/10

Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
arXiv: 2602.09782v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

通过动态梯度裁剪实现强化学习中LLM策略熵的精确控制,有效缓解熵坍塌问题。

主要贡献

  • 提出基于梯度保留裁剪的熵控制视角
  • 理论和实验验证了重要性采样比率对熵变化的影响
  • 设计了动态裁剪阈值和熵控制策略

方法论

理论分析梯度裁剪对熵的影响,并设计动态裁剪阈值来精确控制熵,结合多种动态熵控制策略进行实验验证。

原文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.

标签

强化学习 LLM 熵控制 梯度裁剪 策略优化

arXiv 分类

cs.LG cs.AI cs.CL