🤖 AI Summary
This work addresses the limited scalability, procedural complexity, and poor reproducibility of large-scale reinforcement learning (RL) for enhancing large language model (LLM) reasoning capabilities. We propose the first open-source, fully reproducible pure-RL reasoning training paradigm: it eliminates KL-divergence regularization and instead adopts a minimalist PPO implementation with generalized advantage estimation (GAE) using γ = λ = 1, coupled with rule-based sparse reward shaping, enabling end-to-end unsupervised optimization. Our method achieves superior performance over DeepSeek-R1-Zero-Qwen-32B on AIME2024, MATH500, and GPQA Diamond—matching or exceeding its results with only 10% of the training steps. We publicly release all code, datasets, and model weights across multiple scales. This significantly improves the simplicity, scalability, and accessibility of RLHF-based reasoning training.
📝 Abstract
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($lambda=1$, $gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.