🤖 AI Summary
Reinforcement learning (RL) is critical for large language model (LLM) training, yet it lacks predictable scaling laws analogous to pretraining and standardized evaluation protocols. Method: Through systematic experiments consuming over 400,000 GPU-hours, we construct the first S-shaped compute–performance curve for RL training, analyze the impact of architecture, algorithmic design, and curriculum learning on asymptotic performance and computational efficiency, and propose a scalable analytical framework incorporating sigmoid fitting, loss normalization, off-policy optimization, and ablation studies. Contribution/Results: We establish ScaleRL—a set of best practices enabling high-accuracy extrapolation of validation performance from a single 100,000-GPU-hour run. Our work significantly enhances the predictability, measurability, and computational efficiency of LLM-RL training, laying foundational groundwork for principled, resource-aware RL-based LLM optimization.
📝 Abstract
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.