MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing hybrid-attention models suffer from low inference-time computational scaling efficiency on long-context and high-order reasoning tasks. Method: We introduce the first open-source large-parameter hybrid-attention inference model, featuring a native 1M-context-supporting MoE architecture with 456B parameters and 45.9B activated parameters. We propose Lightning Attention—a novel attention mechanism—and CISPO, a reinforcement learning (RL) algorithm leveraging importance-weighted pruning to significantly accelerate RL training. Full-scale RL training was completed in just three weeks on 512 H800 GPUs at a cost of $535K. Contribution/Results: Our model achieves state-of-the-art performance across software engineering, long-document understanding, and tool-use benchmarks, consistently outperforming strong baselines including DeepSeek-R1 and Qwen3-235B. This demonstrates the effectiveness of jointly optimizing inference-time computational scaling and large-scale MoE RL training.

Technology Category

Application Category

📝 Abstract

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling test-time compute with lightning attention

Processing long inputs for complex tasks effectively

Enhancing RL training efficiency with hybrid-attention and CISPO

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mixture-of-Experts with lightning attention

Native support for 1M token context

CISPO algorithm for efficient RL training

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow