🤖 AI Summary
Understanding how large language models (LLMs) scale during reinforcement learning (RL) post-training for mathematical reasoning remains underexplored.
Method: We conduct 54 controlled experiments to quantitatively analyze the interplay among model size, dataset volume, and computational budget.
Contribution/Results: We identify four key scaling laws: (1) Larger models achieve superior performance with fewer training steps, exhibiting higher sample efficiency; (2) Repeated utilization of high-quality data significantly alleviates data scarcity bottlenecks; (3) Under fixed computational or data budgets, scaling up model size consistently yields better outcomes; (4) RL learning dynamics are consistent between base and instruction-tuned models. These findings establish reproducible, empirically grounded scaling principles—enabling cost-effective, efficient enhancement of LLMs’ mathematical reasoning capabilities through principled RL post-training.
📝 Abstract
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on 54 experiments across diverse model sizes and training settings, we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: (1). Under a fixed computational budget, larger models trained for fewer steps consistently outperform smaller models trained for more steps. (2). Given a fixed amount of training data, larger models achieve superior sample efficiency, yielding lower loss. (3). In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. (4). These scaling behaviors are robust across both base and instruction-tuned models, which share similar learning dynamics (e.g., larger models show faster convergence) even while differing in absolute accuracy. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.