Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Understanding how large language models (LLMs) scale during reinforcement learning (RL) post-training for mathematical reasoning remains underexplored. Method: We conduct 54 controlled experiments to quantitatively analyze the interplay among model size, dataset volume, and computational budget. Contribution/Results: We identify four key scaling laws: (1) Larger models achieve superior performance with fewer training steps, exhibiting higher sample efficiency; (2) Repeated utilization of high-quality data significantly alleviates data scarcity bottlenecks; (3) Under fixed computational or data budgets, scaling up model size consistently yields better outcomes; (4) RL learning dynamics are consistent between base and instruction-tuned models. These findings establish reproducible, empirically grounded scaling principles—enabling cost-effective, efficient enhancement of LLMs’ mathematical reasoning capabilities through principled RL post-training.

Technology Category

Application Category

📝 Abstract

While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on 54 experiments across diverse model sizes and training settings, we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: (1). Under a fixed computational budget, larger models trained for fewer steps consistently outperform smaller models trained for more steps. (2). Given a fixed amount of training data, larger models achieve superior sample efficiency, yielding lower loss. (3). In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. (4). These scaling behaviors are robust across both base and instruction-tuned models, which share similar learning dynamics (e.g., larger models show faster convergence) even while differing in absolute accuracy. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.

Problem

Research questions and friction points this paper is trying to address.

Investigates scaling laws for LLMs in reinforcement learning post-training phase

Examines how model size, data volume and compute budget affect performance

Focuses on mathematical reasoning capabilities through systematic empirical analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Larger models outperform smaller ones under fixed compute

Larger models achieve superior sample efficiency with fixed data

Repeated data reuse effective in data-constrained training regimes

🔎 Similar Papers

Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering