TRE: Encouraging Exploration in the Trust Region
AI 摘要
论文提出了一种Trust Region Entropy(TRE)方法,提升LLM在强化学习中的探索能力。
主要贡献
- 发现了标准熵正则化在LLM中失效的原因是累积尾部风险
- 提出了TRE方法,在模型信任区域内鼓励探索
- 实验证明TRE在数学推理、组合搜索和偏好对齐任务中优于其他基线方法
方法论
提出TRE方法,限制探索范围在模型的信任区域内,避免无效token的影响。使用PPO算法进行训练和评估。
原文摘要
Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.