🤖 AI Summary
This work investigates how to effectively design mid-training for large language models to enhance reasoning capabilities while preserving general performance. Through controlled experiments across model scales (3B–24B parameters), architectures, and families, the study systematically evaluates the impact of mid-training on mathematical, coding, and scientific reasoning, and explores its synergy with reinforcement learning (RL). Leveraging multi-model comparisons, CKA representation analysis, parameter trajectory tracking, and a high-quality 27B-token dataset, the authors demonstrate that mid-training reconstructs internal representations with high retention, providing an effective initialization for subsequent RL. They further show that data composition during mid-training is more critical than RL itself and reveal fundamental differences in parameter update dynamics between the two stages. The proposed PRISM-to-RL pipeline yields substantial gains—+15 to +40 in math, +5 to +12 in code, and +6 to +13 in science—elevating the macro-average score across six reasoning benchmarks from under 12 to 29–42 (a 3–4× improvement), whereas direct RL alone proves largely ineffective.
📝 Abstract
We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.