Context Bootstrapped Reinforcement Learning
AI 摘要
CBRL通过注入示范提升强化学习探索效率,在多种推理任务上验证有效性。
主要贡献
- 提出Context Bootstrapped Reinforcement Learning (CBRL)
- 通过预先注入示范来引导探索,提高RLVR的探索效率
- 验证了CBRL在多个推理任务和模型上的有效性
方法论
CBRL通过随机地将少量示范添加到训练提示中,并逐渐降低添加概率,引导模型学习推理模式。
原文摘要
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.