JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work challenges the prevailing necessity of multi-stage training, dynamic hyperparameter scheduling, and curriculum learning in reinforcement learning (RL) for large language models (LLMs). We propose a minimalist RLHF paradigm: single-stage PPO training on a 1.5B reasoning model with all hyperparameters held fixed—omitting length penalties, external reward validators, and any human intervention. Our key insight is that “stable scaling”—i.e., consistent, well-conditioned optimization dynamics—suffices to mitigate training collapse and plateauing, enabling monotonic convergence throughout training. Evaluated on nine mathematical reasoning benchmarks, our approach achieves average accuracies of 54.9% and 64.3% (two evaluation settings), matching state-of-the-art performance while reducing computational cost by 50%. These results demonstrate that substantial simplification of the RL pipeline is not only feasible but also enhances both training efficiency and stability.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: extbf{Is this complexity necessary?} We present extbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9% and 64.3% average accuracy across nine mathematical benchmarks) while using 2$ imes$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
Problem

Research questions and friction points this paper is trying to address.

Simplifies complex RL training for large language models
Achieves state-of-the-art performance with minimal single-stage approach
Questions necessity of elaborate methods in LLM reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage training with fixed hyperparameters
Achieves state-of-the-art performance with less compute
Simplified approach avoids performance-degrading standard tricks
🔎 Similar Papers
Bingxiang He
Bingxiang He
Second year PhD Candidate, Tsinghua University
Natural Language Processing
Z
Zekai Qu
Tsinghua University
Z
Zeyuan Liu
Tsinghua University
Y
Yinghao Chen
Tsinghua University
Y
Yuxin Zuo
Tsinghua University
C
Cheng Qian
University of Illinois Urbana-Champaign
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Weize Chen
Weize Chen
Tsinghua University
NLPML
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
N
Ning Ding
Tsinghua University
Z
Zhiyuan Liu
Tsinghua University