Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance degradation often incurred when integrating Mixture-of-Experts (MoE) into State Space Models (SSMs) for enhanced expressivity, this paper proposes Routing Mamba (RoM), a sparse MoE extension tailored for the Mamba architecture. RoM innovatively shares linear projection layers and routing decisions across submodules, fostering expert collaboration while balancing modeling capacity and computational efficiency. It further incorporates input-dependent gating, hardware-aware design, and the intrinsic linear SSM structure to ensure stable perplexity on long sequences. Experiments demonstrate that RoM achieves language modeling performance comparable to a dense model requiring 2.3× more parameters—despite activating only 1.3B parameters—while reducing FLOPS by 23%. This yields substantial improvements in parameter scaling efficiency, establishing a new trade-off frontier between model size, computation, and accuracy.

Technology Category

Application Category

📝 Abstract
Linear State Space Models (SSMs) offer remarkable performance gains in efficient sequence modeling, with constant inference-time computation and memory complexity. Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations, positioning them as strong alternatives to Transformers for long sequence modeling. However, efficiently scaling the expressive power of SSMs, particularly with Mixture of Experts (MoE), remains challenging, as naive integration attempts often falter or degrade performance. In this work, we introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts. By sharing routing decisions between projection layers and lightweight sub-modules within Mamba across experts, RoM leverages synergies among linear projection experts for effective and efficient sparse scaling of Mamba layers. At a scale of 1.3B active parameters (10B total) and 16K training sequence length, RoM achieves language modeling performance equivalent to a dense Mamba model requiring over 2.3x more active parameters, and demonstrates consistent perplexity across context lengths. Experimental results further show RoM effectively scales hybrid language models, yielding a 23% FLOPS saving compared to dense Mamba scaling for similar performance.
Problem

Research questions and friction points this paper is trying to address.

Scaling State Space Models with Mixture-of-Experts efficiently
Improving performance of SSMs without increasing active parameters
Enhancing sparse scaling of Mamba layers for long sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales SSMs with sparse mixture-of-experts projection
Shares routing decisions across projection layers
Achieves efficient sparse scaling of Mamba layers
🔎 Similar Papers
No similar papers found.