Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

📅 2025-03-31

📈 Citations: 6

✨ Influential: 1

career value

183K/year

🤖 AI Summary

This work addresses the limited scalability, procedural complexity, and poor reproducibility of large-scale reinforcement learning (RL) for enhancing large language model (LLM) reasoning capabilities. We propose the first open-source, fully reproducible pure-RL reasoning training paradigm: it eliminates KL-divergence regularization and instead adopts a minimalist PPO implementation with generalized advantage estimation (GAE) using γ = λ = 1, coupled with rule-based sparse reward shaping, enabling end-to-end unsupervised optimization. Our method achieves superior performance over DeepSeek-R1-Zero-Qwen-32B on AIME2024, MATH500, and GPQA Diamond—matching or exceeding its results with only 10% of the training steps. We publicly release all code, datasets, and model weights across multiple scales. This significantly improves the simplicity, scalability, and accessibility of RLHF-based reasoning training.

Technology Category

Application Category

📝 Abstract

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($lambda=1$, $gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Problem

Research questions and friction points this paper is trying to address.

Scaling up reinforcement learning for large-scale reasoning tasks

Achieving efficiency with minimalist RL training approach

Open-sourcing implementation for accessibility and reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open source large-scale RL training implementation

Minimalist PPO with GAE and rule-based rewards

Superior benchmark performance with fewer steps

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL