OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study investigates the root causes of performance disparities between foundational language models (e.g., Llama and Qwen) in reinforcement learning (RL) post-training, centering on the question: *“Which mid-training strategies enhance model RL scalability?”* We propose the Stable-then-Decay two-phase mid-training paradigm. Systematically, we validate that high-quality mathematical corpora and chain-of-thought reasoning data critically improve RL stability and downstream performance, while data formatting significantly affects training robustness. Our methodology integrates large-scale mid-pretraining, RL-aware dynamic optimization, multi-branch architecture design, and scheduled learning rate decay. Key contributions include: (1) the OctoThinker model series, which substantially improves RL compatibility of Llama-family models and narrows their mathematical reasoning and RL performance gap with Qwen; and (2) the open-sourcing of MegaMath-Web-Pro-Max—a 70B-token high-quality mathematical reasoning corpus—enabling reproducible, scalable RL research.

Technology Category

Application Category

📝 Abstract

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

Problem

Research questions and friction points this paper is trying to address.

Investigates mid-training strategies for RL-scalable language models

Examines impact of math corpora and QA data on RL performance

Proposes Stable-then-Decay strategy to enhance RL compatibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-quality math corpora boost RL performance

QA-style data enhances RL training outcomes

Two-stage mid-training strategy improves RL compatibility

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL