SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
AI 摘要
SortedRL通过在线长度感知调度加速LLM的RL训练,提高rollout效率并保持训练稳定性。
主要贡献
- 提出SortedRL在线长度感知调度策略,优化RL训练效率。
- 设计基于缓存的机制控制off-policy训练程度。
- 构建专用RL基础设施管理rollout和更新。
方法论
SortedRL重排序rollout样本,优先处理短样本,构建大批量rollout和灵活更新批次,并结合缓存机制控制off-policy程度。
原文摘要
Scaling reinforcement learning (RL) has shown strong promise for enhancing the reasoning abilities of large language models (LLMs), particularly in tasks requiring long chain-of-thought generation. However, RL training efficiency is often bottlenecked by the rollout phase, which can account for up to 70% of total training time when generating long trajectories (e.g., 16k tokens), due to slow autoregressive generation and synchronization overhead between rollout and policy updates. We propose SortedRL, an online length-aware scheduling strategy designed to address this bottleneck by improving rollout efficiency and maintaining training stability. SortedRL reorders rollout samples based on output lengths, prioritizing short samples forming groups for early updates. This enables large rollout batches, flexible update batches, and near on-policy micro-curriculum construction simultaneously. To further accelerate the pipeline, SortedRL incorporates a mechanism to control the degree of off-policy training through a cache-based mechanism, and is supported by a dedicated RL infrastructure that manages rollout and update via a stateful controller and rollout buffer. Experiments using LLaMA-3.1-8B and Qwen-2.5-32B on diverse tasks, including logical puzzles, and math challenges like AIME 24, Math 500, and Minerval, show that SortedRL reduces RL training bubble ratios by over 50%, while attaining 3.9% to 18.4% superior performance over baseline given same amount of data.