Agent Tuning & Optimization 相关度: 9/10

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
arXiv: 2602.17616v1 发布: 2026-02-19 更新: 2026-02-19

AI 摘要

针对LLM异步RL训练中梯度方差过高问题,提出方差控制策略优化VCPO算法。

主要贡献

  • 诊断了异步RL中高方差梯度导致训练崩溃的问题,并与有效样本量ESS和梯度范数相关联。
  • 提出了VCPO算法,通过基于ESS调整学习率和最小方差基线来控制方差。
  • 实验表明VCPO显著提升了异步训练的鲁棒性,并在数学、推理和工具使用任务上优于现有方法。

方法论

提出VCPO算法,通过ESS调整学习率控制更新幅度,并使用闭式解最小方差基线降低方差。

原文摘要

Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.

标签

LLM Reinforcement Learning Asynchronous Training Variance Reduction

arXiv 分类

cs.LG cs.AI