MiMo-V2-Flash Technical Report

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 11

✨ Influential: 1

career value

249K/year

🤖 AI Summary

This work proposes a 309B-parameter sparse mixture-of-experts (MoE) language model with only 15B activated parameters per token, designed to enhance reasoning speed, capability, and agent-task performance while reducing computational costs. The architecture integrates sliding-window and global attention mechanisms and introduces a multi-token prediction (MTP) framework alongside a multi-teacher online policy distillation (MOPD) approach to enable efficient training and speculative decoding. Despite using merely one-half to one-third of the activated parameters compared to leading open-source models of similar scale, the proposed model achieves comparable or superior performance, accelerates inference by up to 2.6×, and supports context lengths of up to 3.6 million tokens.

Technology Category

Application Category

📝 Abstract

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

fast reasoning

agentic capabilities

large language model

efficient inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Hybrid Attention

Multi-Token Prediction