MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hybrid-attention models suffer from low inference-time computational scaling efficiency on long-context and high-order reasoning tasks. Method: We introduce the first open-source large-parameter hybrid-attention inference model, featuring a native 1M-context-supporting MoE architecture with 456B parameters and 45.9B activated parameters. We propose Lightning Attention—a novel attention mechanism—and CISPO, a reinforcement learning (RL) algorithm leveraging importance-weighted pruning to significantly accelerate RL training. Full-scale RL training was completed in just three weeks on 512 H800 GPUs at a cost of $535K. Contribution/Results: Our model achieves state-of-the-art performance across software engineering, long-document understanding, and tool-use benchmarks, consistently outperforming strong baselines including DeepSeek-R1 and Qwen3-235B. This demonstrates the effectiveness of jointly optimizing inference-time computational scaling and large-scale MoE RL training.

Technology Category

Application Category

📝 Abstract
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling test-time compute with lightning attention
Processing long inputs for complex tasks effectively
Enhancing RL training efficiency with hybrid-attention and CISPO
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mixture-of-Experts with lightning attention
Native support for 1M token context
CISPO algorithm for efficient RL training
🔎 Similar Papers
No similar papers found.
M
MiniMax Aili Chen
A
Aonian Li
B
Bangwei Gong
B
Binyang Jiang
B
Bo Fei
B
Bo Yang
B
Boji Shan
C
Changqing Yu
C
Chao Wang
Cheng Zhu
Cheng Zhu
J. Erskine Love Jr. Endowed Chair in Engineering and Regents' Professor
BiomechanicsMechanobiologyImmunologyCancerHemostasis and Thrombosis
C
Chengjun Xiao
Chengyu Du
Chengyu Du
Fudan Univerity
LLMAgentRL
C
Chi Zhang
C
Chu Qiao
C
Chunhao Zhang
C
Chunhui Du
C
Congchao Guo
D
Da Chen
D
Deming Ding
D
Dianjun Sun
D
Dong Li
E
Enwei Jiao
Haigang Zhou
Haigang Zhou
Haimo Zhang
Haimo Zhang
H
Han Ding
H
Haohai Sun
H
Haoyu Feng
Huaiguang Cai
Huaiguang Cai
Institute of Automation, Chinese Academy of Sciences
Computational Game TheoryOnline LearningLLM
H
Haichao Zhu
J
Jian Sun
J
Jiaqi Zhuang
J
Jiaren Cai
J
Jiayuan Song
J
Jin Zhu
Jingyang Li
Jingyang Li
PhD Student, National University of Singapore
optimizationdeep learning
J
Jinhao Tian
J
Jinli Liu
J
Junhao Xu
J
Junjie Yan
Junteng Liu
Junteng Liu
Hong Kong University of Science and Technology
Machine LearningNature Language Processing
Junxian He
Junxian He
Hong Kong University of Science and Technology
Machine LearningNatural Language Processing
K
Kaiyi Feng
K
Ke Yang
K
Kecheng Xiao
L
Le Han
Leyang Wang
Leyang Wang
University College London
Machine LearningStatistics
L
Lianfei Yu
L
Liheng Feng
L
Lin Li
L
Lin Zheng
L
Linge Du
L
Lingyu Yang
Lunbin Zeng
Lunbin Zeng
Huazhong University of Science and Technology
compute vision
Ming-Yuan Yu
Ming-Yuan Yu
M
Mingliang Tao
M
Mingyuan Chi
Mozhi Zhang
Mozhi Zhang
ByteDance Seed
Large Language ModelNatural Language Processing
M
Mujie Lin
N
Nan Hu
N
Nongyu Di
P
Peng Gao
P
Pengfei Li
Pengyu Zhao
Pengyu Zhao
Peking University
Neural Architecture SearchRecommender System360-degree Video
Qibing Ren
Qibing Ren
Shanghai Jiao Tong University
machine learningcomputer visiontrustworthy AI
Q
Qile Li
Qin Wang
Qin Wang
ETH Zurich
Domain AdaptationComputer Vision
Rong Tian
Rong Tian
Harbin Institute of Technology
natural language processingmachine learninginformation retrievalaccelerated computing
R
Ruitao Leng
S
Shaoxiang Chen
S
Shaoyu Chen
S
Shengmin Shi
S
Shitong Weng
S
Shuchang Guan
S
Shuqi Yu
S
Sichen Li
S
Songquan Zhu
T
Tengfei Li
Tianchi Cai
Tianchi Cai
LLM Alignment, Minimax
LLMAlignmentRLBudget allocation
T
Tianrun Liang
W
Weiyu Cheng
Weize Kong
Weize Kong
OpenAI
Large Language ModelsInformation Retrieval
W
Wenkai Li
X
Xiancai Chen
X
Xiangjun Song
X
Xiao Luo
Xiao Su
Xiao Su
X
Xiaobo Li
X
Xiaodong Han
X
Xinzhu Hou
Xuan Lu
Xuan Lu
Assistant Professor, University of Arizona
Human-centered Data ScienceHuman-AI CollaborationCausal InferenceFuture of WorkEmoji
Xun Zou
Xun Zou
Xuyang Shen
Xuyang Shen
MiniMax | ANU
Multimodal Machine Learning
Y
Yan Gong
Y
Yan Ma
Y
Yang Wang
Y
Yiqi Shi
Yiran Zhong
Yiran Zhong
PhD, Australian National University
LLMSelf-supervised LearningVisual Geometry LearningNatural Language ProcessingMultimodal
Y
Yonghong Duan
Y
Yongxiang Fu
Y
Yongyi Hu
Y
Yu Gao
Yuanxiang Fan
Yuanxiang Fan
Osaka University
Reinforcement LearningMachine LearningNLP
Y
Yufeng Yang
Y
Yuhao Li
Y
Yulin Hu
Y
Yunan Huang
Y
Yunji Li
Y
Yunzhi Xu
Y
Yuxin Mao
Y
Yuxuan Shi
Y
Yuze Wenren
Zehan Li
Zehan Li
PhD, UTHealth Houston
AI for Mental HealthPsychiatryBiomedical InformaticsLLMsClinical Phenotyping
Z
Zelin Li
Z
Zhanxu Tian
Z
Zhen-kun Zhu
Z
Zhenhua Fan
Z
Zhenzhen Wu
Zhichao Xu
Zhichao Xu
Amazon AWS, University of Utah
natural language processinginformation retrieval
Z
Zhihang Yu
Z
Zhiheng Lyu
Z
Zhuo Jiang
Z
Zi Gao
Z
Zijia Wu
Z
Zijian Song
Z
Zijun Sun