The Art of Scaling Reinforcement Learning Compute for LLMs

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) is critical for large language model (LLM) training, yet it lacks predictable scaling laws analogous to pretraining and standardized evaluation protocols. Method: Through systematic experiments consuming over 400,000 GPU-hours, we construct the first S-shaped compute–performance curve for RL training, analyze the impact of architecture, algorithmic design, and curriculum learning on asymptotic performance and computational efficiency, and propose a scalable analytical framework incorporating sigmoid fitting, loss normalization, off-policy optimization, and ablation studies. Contribution/Results: We establish ScaleRL—a set of best practices enabling high-accuracy extrapolation of validation performance from a single 100,000-GPU-hour run. Our work significantly enhances the predictability, measurability, and computational efficiency of LLM-RL training, laying foundational groundwork for principled, resource-aware RL-based LLM optimization.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.
Problem

Research questions and friction points this paper is trying to address.

Developing predictive scaling methodologies for reinforcement learning in LLMs
Analyzing algorithmic improvements for scaling RL compute efficiently
Establishing principled framework to predict RL scaling trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defined principled framework for RL scaling analysis
Fitted sigmoidal compute-performance curves for training
Proposed best-practice recipe ScaleRL for predictable scaling
🔎 Similar Papers
No similar papers found.