Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
AI 摘要
Goldilocks RL通过动态调整训练难度,克服了强化学习在稀疏奖励下推理任务中的低效问题。
主要贡献
- 提出Goldilocks数据采样策略,根据学生模型能力动态选择难度合适的样本
- 利用教师模型预测问题难度,并指导学生模型训练
- 在OpenMathReasoning数据集上验证了该方法的有效性
方法论
提出教师驱动的数据采样策略,教师模型预测难度,学生模型使用GRPO训练,教师根据学生表现调整难度。
原文摘要
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.