JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work challenges the prevailing necessity of multi-stage training, dynamic hyperparameter scheduling, and curriculum learning in reinforcement learning (RL) for large language models (LLMs). We propose a minimalist RLHF paradigm: single-stage PPO training on a 1.5B reasoning model with all hyperparameters held fixed—omitting length penalties, external reward validators, and any human intervention. Our key insight is that “stable scaling”—i.e., consistent, well-conditioned optimization dynamics—suffices to mitigate training collapse and plateauing, enabling monotonic convergence throughout training. Evaluated on nine mathematical reasoning benchmarks, our approach achieves average accuracies of 54.9% and 64.3% (two evaluation settings), matching state-of-the-art performance while reducing computational cost by 50%. These results demonstrate that substantial simplification of the RL pipeline is not only feasible but also enhances both training efficiency and stability.

Technology Category

Application Category

📝 Abstract

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: extbf{Is this complexity necessary?} We present extbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2$ imes$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

Problem

Research questions and friction points this paper is trying to address.

Simplifies complex RL training for large language models

Achieves state-of-the-art performance with minimal single-stage approach

Questions necessity of elaborate methods in LLM reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage training with fixed hyperparameters

Achieves state-of-the-art performance with less compute

Simplified approach avoids performance-degrading standard tricks

🔎 Similar Papers

Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers