🤖 AI Summary
This work challenges the prevailing necessity of multi-stage training, dynamic hyperparameter scheduling, and curriculum learning in reinforcement learning (RL) for large language models (LLMs). We propose a minimalist RLHF paradigm: single-stage PPO training on a 1.5B reasoning model with all hyperparameters held fixed—omitting length penalties, external reward validators, and any human intervention. Our key insight is that “stable scaling”—i.e., consistent, well-conditioned optimization dynamics—suffices to mitigate training collapse and plateauing, enabling monotonic convergence throughout training. Evaluated on nine mathematical reasoning benchmarks, our approach achieves average accuracies of 54.9% and 64.3% (two evaluation settings), matching state-of-the-art performance while reducing computational cost by 50%. These results demonstrate that substantial simplification of the RL pipeline is not only feasible but also enhances both training efficiency and stability.
📝 Abstract
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: extbf{Is this complexity necessary?} We present extbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2$ imes$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.