SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

📅 2025-03-24

🏛️ arXiv.org

📈 Citations: 25

✨ Influential: 4

career value

193K/year

🤖 AI Summary

This work investigates the universality and training dynamics of zero-shot reinforcement learning (Zero RL) across heterogeneous foundation models. Method: We systematically evaluate whether chain-of-thought (CoT) reasoning emerges directly from base models—without explicit CoT supervision—across ten open-source models spanning diverse architectures and scales. We introduce two key design principles: format reward shaping and query difficulty control, and integrate rule-based RL, implicit CoT supervision, joint monitoring of response length and verification behavior, and a cross-model training dynamics analysis framework. Contribution/Results: We observe, for the first time, a “reasoning insight moment” in non-Qwen small-scale models. Our analysis reveals a non-monotonic relationship between model scale and training dynamics. Experiments demonstrate significant improvements in reasoning accuracy and response length across most models. To foster reproducibility, we open-source all code, fine-tuned models, and analytical tools.

Technology Category

Application Category

📝 Abstract

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the"aha moment"). Notably, we observe the"aha moment"for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

Problem

Research questions and friction points this paper is trying to address.

Investigates zero RL training across diverse base models

Explores emergence of reasoning in non-Qwen small models

Improves reasoning accuracy via reward and difficulty control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero RL training from diverse base models

Adjusting format reward and query difficulty

Monitoring distinct training dynamics patterns

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL