Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
AI 摘要
提出Composition-RL方法,通过组合prompt优化LLM的强化学习训练,提升推理能力。
主要贡献
- 提出Composition-RL方法,利用pass-rate-1的prompt进行组合训练。
- 证明Composition-RL在不同模型尺寸下能稳定提升推理能力。
- 提出Composition-RL的课程学习变体,进一步提升性能。
- 证明Composition-RL能有效进行跨领域强化学习。
方法论
自动将多个问题组合成新的可验证问题,并使用这些组合prompt进行强化学习训练。可使用课程学习策略逐步增加组合深度。
原文摘要
Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at https://github.com/XinXU-USTC/Composition-RL.