M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the test-time scaling bottleneck in large language models for complex mathematical reasoning—caused by Transformer’s quadratic computational complexity—this paper proposes the first hybrid linear RNN inference framework based on the Mamba state-space model. Methodologically, it integrates chain-of-thought prompting, knowledge distillation, and PPO-based reinforcement learning, augmented by self-consistency ensembling for acceleration. Key contributions include: (i) the first demonstration of linear-complexity RNNs achieving accuracy on par with DeepSeek R1 (a distilled Transformer) on the AIME and MATH benchmarks; (ii) over 3× inference speedup; and (iii) significant performance gains under fixed time budgets via self-consistency. This work breaks the computational and memory scaling barriers inherent in long-chain reasoning, establishing a new paradigm for efficient mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.
Problem

Research questions and friction points this paper is trying to address.

Scalable test-time compute for complex math problems
Memory-efficient inference with hybrid linear RNN model
Faster generation speed compared to transformer models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid linear RNN model for memory-efficient inference
Distillation and RL training enhance performance
3x speedup over transformers with higher accuracy
🔎 Similar Papers
No similar papers found.