Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address high computational complexity, large KV cache overhead, and reliance on separate visual encoders in multimodal large language model (MLLM) deployment, this paper introduces mmMamba—the first native linear-complexity multimodal state space model. We propose a three-stage progressive knowledge distillation framework to end-to-end transfer decoder-only MLLMs (e.g., HoVLE) into the Mamba architecture. We further introduce seeded Mamba token extraction and a configurable Transformer-Mamba hybrid layer design to flexibly balance accuracy and efficiency. Crucially, mmMamba eliminates the need for pretrained visual encoders or RNN components, significantly lowering deployment barriers. Experiments show that mmMamba-linear achieves 20.6× speedup and 75.8% memory reduction over HoVLE on 103K tokens; mmMamba-hybrid matches HoVLE’s performance while delivering 13.5× speedup and 60.2% memory savings. Code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$ imes$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$ imes$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity in multimodal models.

Eliminates need for separate vision encoders.

Enhances speed and GPU memory efficiency.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear-complexity multimodal state space model

Progressive distillation from MLLMs

Hybrid Transformer-Mamba architectures

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs