MiMo-V2-Flash Technical Report

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
This work proposes a 309B-parameter sparse mixture-of-experts (MoE) language model with only 15B activated parameters per token, designed to enhance reasoning speed, capability, and agent-task performance while reducing computational costs. The architecture integrates sliding-window and global attention mechanisms and introduces a multi-token prediction (MTP) framework alongside a multi-teacher online policy distillation (MOPD) approach to enable efficient training and speculative decoding. Despite using merely one-half to one-third of the activated parameters compared to leading open-source models of similar scale, the proposed model achieves comparable or superior performance, accelerates inference by up to 2.6×, and supports context lengths of up to 3.6 million tokens.

Technology Category

Application Category

📝 Abstract
We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
fast reasoning
agentic capabilities
large language model
efficient inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Hybrid Attention
Multi-Token Prediction
On-Policy Distillation
Speculative Decoding
🔎 Similar Papers
No similar papers found.
B
Bangjun Xiao
LLM-Core Xiaomi
B
Bing Xia
LLM-Core Xiaomi
B
Bo Yang
LLM-Core Xiaomi
Bofei Gao
Bofei Gao
Peking University
Natural Language Processing
B
Bowen Shen
LLM-Core Xiaomi
C
Chen Zhang
LLM-Core Xiaomi
C
Chenhong He
LLM-Core Xiaomi
Chiheng Lou
Chiheng Lou
Peking University
F
Fuli Luo
LLM-Core Xiaomi
G
Gang Wang
LLM-Core Xiaomi
Gang Xie
Gang Xie
Academy of Mathematics and Systems Science, Chinese Academy of Sciences (CAS)
Economic and financial forecastingLogistics and supply chain management
H
Hailin Zhang
LLM-Core Xiaomi
H
Hanglong Lv
LLM-Core Xiaomi
H
Hanyu Li
LLM-Core Xiaomi
H
Heyu Chen
LLM-Core Xiaomi
Hongshen Xu
Hongshen Xu
Shanghai Jiao Tong University
Natural Language ProcessingLarge Language ModelLLM Alignment
H
Houbin Zhang
LLM-Core Xiaomi
H
Huaqiu Liu
LLM-Core Xiaomi
J
Jiangshan Duo
LLM-Core Xiaomi
Jianyu Wei
Jianyu Wei
USTC & MSRA Joint PhD
LLM InfraInference SystemQuantizationKernelCo-design
J
Jiebao Xiao
LLM-Core Xiaomi
Jinhao Dong
Jinhao Dong
Peking University
SE Augments AITrustworthy Software DevelopmentPre-trainingCode Generation
J
Jun Shi
LLM-Core Xiaomi
J
Junhao Hu
LLM-Core Xiaomi
K
Kainan Bao
LLM-Core Xiaomi
K
Kang Zhou
LLM-Core Xiaomi
L
Lei Li
LLM-Core Xiaomi
Liang Zhao
Liang Zhao
StepFun
MLLMLLM
L
Linghao Zhang
LLM-Core Xiaomi
P
Peidian Li
LLM-Core Xiaomi
Q
Qianli Chen
LLM-Core Xiaomi
S
Shaohui Liu
LLM-Core Xiaomi
S
Shihua Yu
LLM-Core Xiaomi
Shijie Cao
Shijie Cao
Microsoft Research Asia
Efficient Deep LearningDeep Learning SystemComputer Architecture
S
Shimao Chen
LLM-Core Xiaomi
S
Shouqiu Yu
LLM-Core Xiaomi
S
Shuo Liu
LLM-Core Xiaomi
T
Tianling Zhou
LLM-Core Xiaomi
W
Weijiang Su
LLM-Core Xiaomi
Weikun Wang
Weikun Wang
Microsoft
Statistical AnalysisModelingNLPDeep Learning
W
Wenhan Ma
LLM-Core Xiaomi
X
Xiangwei Deng
LLM-Core Xiaomi
B
Bo Mao
LLM-Core Xiaomi
B
Bowen Ye
LLM-Core Xiaomi
C
Can Cai
LLM-Core Xiaomi
C
Chenghua Wang
LLM-Core Xiaomi
C
Chengxuan Zhu
LLM-Core Xiaomi
Chong Ma
Chong Ma
Southwest Jiaotong University
Deep LearningHuman Computer InteractionMedical Image Analysis
C
Chun Chen
LLM-Core Xiaomi
C
Chunan Li
LLM-Core Xiaomi
D
Dawei Zhu
LLM-Core Xiaomi
D
Deshan Xiao
LLM-Core Xiaomi
D
Dong Zhang
LLM-Core Xiaomi
Duo Zhang
Duo Zhang
Twitter, Inc.
Text MiningInformation RetrievalData MiningMachine LearningSocial Networks
F
Fangyue Liu
LLM-Core Xiaomi
F
Feiyu Yang
LLM-Core Xiaomi
F
Fengyuan Shi
LLM-Core Xiaomi
Guoan Wang
Guoan Wang
Stevens Institute of Technology
General Medical AI
H
Hao Tian
LLM-Core Xiaomi
H
Hao Wu
LLM-Core Xiaomi
H
Hengxu Qu
LLM-Core Xiaomi
H
Hongfei Yi
LLM-Core Xiaomi
H
Hongxu An
LLM-Core Xiaomi
X
Xing Zhang
LLM-Core Xiaomi
Y
Yifan Song
LLM-Core Xiaomi
Y
Yihan Yan
LLM-Core Xiaomi
Yihao Zhao
Yihao Zhao
Peking University
Artificial IntelligenceDeep LearningAI system
Y
Yingchun Lai
LLM-Core Xiaomi
Y
Yizhao Gao
LLM-Core Xiaomi
Y
Yu Cheng
LLM-Core Xiaomi
Yuanyuan Tian
Yuanyuan Tian
Microsoft Gray Systems Lab (GSL)
Big DataSQL-on-HadoopHTAPGraph AnalyticsDatabases
Y
Yudong Wang
LLM-Core Xiaomi
Zhen Tang
Zhen Tang
Institute of Software, Chinese Academy of Sciences
Cloud ComputingVirtualizationStorage
Zhengju Tang
Zhengju Tang
Peking University
Z
Zhengtao Wen
LLM-Core Xiaomi
Z
Zhichao Song
LLM-Core Xiaomi
Z
Zhixian Zheng
LLM-Core Xiaomi
Zihan Jiang
Zihan Jiang
Huawei
AI BenchmarkingDistributed Deep LearningWorkload Characterization.
Jian Wen
Jian Wen
Xiaomi EV Company Limited
Autonomous drivingmotion and path planning
J
Jiarui Sun
LLM-Core Xiaomi
J
Jiawei Li
LLM-Core Xiaomi
Jinlong Xue
Jinlong Xue
Beijing University of Posts and Telecommunications
Speech SynthesisSpeech Processing
J
Jun Xia
LLM-Core Xiaomi
K
Kai Fang
LLM-Core Xiaomi
M
Menghang Zhu
LLM-Core Xiaomi
N
Nuo Chen
LLM-Core Xiaomi
Q
Qian Tu
LLM-Core Xiaomi
Q
Qihao Zhang
LLM-Core Xiaomi
Qiying Wang
Qiying Wang
The University of Sydney
Nonstationary time series econometricsFinancial econometricsNonparametric statisticsEconometric TheorySelf-normalized li
R
Rang Li
LLM-Core Xiaomi
R
Rui Ma
LLM-Core Xiaomi
Shaolei Zhang
Shaolei Zhang
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
Natural Language ProcessingLarge Language ModelMultimodal LLMsSimultaneous Translation
S
Shengfan Wang
LLM-Core Xiaomi
S
Shicheng Li
LLM-Core Xiaomi
Shuhao Gu
Shuhao Gu
Xiaomi
LLMVision-Language ModelAGI
Shuhuai Ren
Shuhuai Ren
Peking University
Deep LearningNatural Language Processing
S
Sirui Deng
LLM-Core Xiaomi
T
Tao Guo
LLM-Core Xiaomi
T
Tianyang Lu
LLM-Core Xiaomi
W
Weiji Zhuang
LLM-Core Xiaomi
W
Weikang Zhang
LLM-Core Xiaomi
Weimin Xiong
Weimin Xiong
Peking University
Computer Science
W
Wenshan Huang
LLM-Core Xiaomi
W
Wenyu Yang
LLM-Core Xiaomi
X
Xin Zhang
LLM-Core Xiaomi
X
Xing Yong
LLM-Core Xiaomi
X
Xu Wang
LLM-Core Xiaomi
X
Xueyang Xie
LLM-Core Xiaomi
Y
Yilin Jiang
LLM-Core Xiaomi
Y
Yixin Yang
LLM-Core Xiaomi
Y
Yongzhe He
LLM-Core Xiaomi
Y
Yu Tu
LLM-Core Xiaomi
Y
Yuanliang Dong
LLM-Core Xiaomi
Y
Yuchen Liu
LLM-Core Xiaomi
Yue Ma
Yue Ma
Bytedance
NLPDialogue SystemLLM
Y
Yue Yu
LLM-Core Xiaomi
Y
Yuxing Xiang
LLM-Core Xiaomi
Z
Zhaojun Huang
LLM-Core Xiaomi
Zhenru Lin
Zhenru Lin
Tsinghua University
Natural Language Processing
Zhipeng Xu
Zhipeng Xu
Northeastern University
NLPInformation Retrieval
Z
Zhiyang Chen
LLM-Core Xiaomi
Z
Zhonghua Deng
LLM-Core Xiaomi
Z
Zihan Zhang
LLM-Core Xiaomi
Zihao Yue
Zihao Yue
Renmin University of China
Multimodal AILanguage Modeling