EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
AI 摘要
提出EMA-PG算法,通过EMA锚定策略和Top-k KL估计改进LLM的策略梯度强化学习。
主要贡献
- 引入EMA锚定策略,提升RL稳定性
- 提出Top-k KL估计,平衡偏差和方差
- 实验证明EMA-PG显著提升LLM在推理和Agent任务上的性能
方法论
使用EMA替代固定锚定策略,引入Top-k KL估计,结合GRPO进行强化学习。
原文摘要
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO. On agentic RL domains, with Qwen-3B base, EMA-PG improves GRPO by an average of 33.3% across 7 datasets of Q&A with search engines, including 29.7% $\rightarrow$ 44.1% on HotpotQA, 27.4% $\rightarrow$ 40.1% on 2WikiMultiHopQA. Overall, we show that EMA-PG is a simple, principled, and powerful approach to scaling RL for LLMs. Code: https://github.com/LunjunZhang/ema-pg