🤖 AI Summary
This work addresses the challenge of sparse rewards in reinforcement learning from verifiable rewards (RLVR), where inefficient exploration often impedes the generation of successful trajectories—particularly in tasks requiring novel reasoning patterns or domain-specific knowledge. To overcome this, the authors propose a context bootstrapping mechanism that dynamically injects a small number of demonstration examples into the training prompt using a curriculum-based annealing strategy: initially included with high probability to guide exploration, these demonstrations are gradually phased out to zero over time, encouraging the policy to internalize reasoning capabilities rather than rely on external exemplars. The approach is algorithm-agnostic and requires no test-time assistance. Experiments across two model families and five Reasoning Gym tasks demonstrate substantial improvements in both success rates and exploration efficiency, with further validation of its practical utility on the esoteric programming language Q.
📝 Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL's practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.