🤖 AI Summary
This work investigates the synergistic effects of pretraining, mid-training, and reinforcement learning (RL) post-training on language models’ reasoning capabilities—specifically whether RL genuinely extends the capability boundaries established during pretraining.
Method: We introduce a controlled experimental framework grounded in synthetic reasoning tasks, atomic operation modeling, and interpretable reasoning-chain generation. Training distributions are systematically manipulated, and process-level rewards are incorporated to guide intermediate reasoning steps.
Contributions/Results: (1) RL yields genuine reasoning improvements only when pretraining preserves sufficient capacity *and* RL focuses on boundary tasks; (2) mid-training substantially outperforms pure RL under fixed compute budgets; (3) context-based transfer requires only minimal pretraining; (4) process-level rewards significantly enhance reasoning fidelity. Collectively, these findings provide causal evidence and principled design guidelines for reasoning-oriented language model training.
📝 Abstract
Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.