Understanding R1-Zero-Like Training: A Critical Perspective

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses two critical issues in the R1-Zero reinforcement learning (RL) training paradigm: (1) the impact of foundation model pretraining characteristics on RL performance, and (2) optimization bias in Group Relative Policy Optimization (GRPO) induced by response-length preference. Through systematic ablation studies and behavioral analysis, we find that mainstream base models (e.g., DeepSeek-V3-Base, Qwen2.5) inherently possess implicit reasoning capabilities (“aha moments”) and pretraining-induced reasoning biases. To mitigate artificial response inflation, we propose Dr. GRPO—a bias-free optimization algorithm that eliminates length-based reward leakage. Building upon this, we design a minimal, highly efficient R1-Zero training framework. Evaluated on AIME 2024, our 7B model achieves 43.3% accuracy—setting a new state-of-the-art—while significantly improving token efficiency and training stability.

Technology Category

Application Category

📝 Abstract

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Problem

Research questions and friction points this paper is trying to address.

Analyze R1-Zero-like training's core components: base models and RL.

Identify optimization bias in GRPO affecting response length.

Propose Dr. GRPO to improve token efficiency and reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning enhances LLMs reasoning directly

Dr. GRPO method improves token efficiency unbiasedly

Minimalist R1-Zero recipe achieves state-of-the-art accuracy

🔎 Similar Papers

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning