MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

📅 2025-05-12
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited reasoning capabilities of large language models (LLMs) in mathematical, programming, and general reasoning tasks by proposing MiMo, a reasoning-optimized 7B model. Methodologically, during pretraining, we introduce data augmentation, a three-stage hybrid data curation strategy, and a multi-token prediction (MTP) objective to accelerate logical reasoning acquisition. In post-training, we design a difficulty-aware reinforcement learning framework based on verifiable mathematical and programming problems, employing Proximal Policy Optimization (PPO) with dynamic resampling and test-case difficulty-driven code reward shaping to mitigate sparse reward challenges. This constitutes the first paradigm integrating pretraining and post-training to jointly enhance reasoning performance. Experiments show that MiMo-7B-Base surpasses 32B baselines across multiple reasoning benchmarks; MiMo-7B-RL outperforms OpenAI o1-mini comprehensively on mathematical reasoning, code generation, and general reasoning tasks.

Technology Category

Application Category

📝 Abstract
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
Problem

Research questions and friction points this paper is trying to address.

Enhancing reasoning potential in language models through optimized pre-training and post-training stages
Addressing sparse-reward issues in reinforcement learning with a test-difficulty-driven code-reward scheme
Improving performance on mathematics, code, and general reasoning tasks beyond larger models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage data mixing strategy enhances reasoning
Multi-Token Prediction objective boosts performance
Test-difficulty-driven code-reward scheme stabilizes training
🔎 Similar Papers
No similar papers found.
X
Xiaomi LLM-Core Team Bingquan Xia
B
Bowen Shen
C
Cici
D
Dawei Zhu
D
Di Zhang
G
Gang Wang
H
Hailin Zhang
H
Huaqiu Liu
J
Jiebao Xiao
Jinhao Dong
Jinhao Dong
Peking University
SE Augments AITrustworthy Software DevelopmentPre-trainingCode Generation
L
Liang Zhao
P
Peidian Li
P
Peng Wang
S
Shihua Yu
S
Shimao Chen
Weikun Wang
Weikun Wang
Microsoft
Statistical AnalysisModelingNLPDeep Learning
W
Wenhan Ma
X
Xiangwei Deng
Y
Yi Huang
Y
Yifan Song
Zihan Jiang
Zihan Jiang
Huawei
AI BenchmarkingDistributed Deep LearningWorkload Characterization.
B
Bowen Ye
C
Can Cai
C
Chenhong He
D
Dong Zhang
Duo Zhang
Duo Zhang
Twitter, Inc.
Text MiningInformation RetrievalData MiningMachine LearningSocial Networks
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
H
Hao Tian
H
Haochen Zhao
H
Heng Qu
Hongshen Xu
Hongshen Xu
Shanghai Jiao Tong University
Natural Language ProcessingLarge Language ModelLLM Alignment
J
Jun Shi
K
Kainan Bao
Q
QingKai Fang
K
Kang Zhou
K
Kangyang Zhou
L
Lei Li
M
Menghang Zhu
N
Nuo Chen
Q
Qiantong Wang
S
Shaohui Liu
S
Shicheng Li
Shuhao Gu
Shuhao Gu
Xiaomi
LLMVision-Language ModelAGI
Shuhuai Ren
Shuhuai Ren
Peking University
Deep LearningNatural Language Processing
S
Shuo Liu
S
Sirui Deng
W
Weiji Zhuang
W
Weiwei Lv
W
Wenyu Yang
X
Xin Zhang
X
Xing Yong
X
Xing Zhang
X
Xingchen Song
X
Xinzhe Xu
X
Xu Wang
Y
Yihan Yan
Y
Yu Tu
Yuanyuan Tian
Yuanyuan Tian
Microsoft Gray Systems Lab (GSL)
Big DataSQL-on-HadoopHTAPGraph AnalyticsDatabases
Y
Yudong Wang
Y
Yue Yu
Zhenru Lin
Zhenru Lin
Tsinghua University
Natural Language Processing
Z
Zhichao Song
Zihao Yue
Zihao Yue
Renmin University of China
Multimodal AILanguage Modeling