Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

📅 2024-03-21
🏛️ arXiv.org
📈 Citations: 46
Influential: 2
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from quadratic computational complexity inherent to Transformer architectures, resulting in inefficient cross-modal understanding and slow inference. To address this, we propose the first Mamba-based MLLM, pioneering the integration of linear-complexity state space models (SSMs) into the visual modality. Our approach introduces a lightweight visual encoder and a serialized cross-modal fusion mechanism that enables efficient feature alignment and dynamic inter-modal information exchange. Despite using only 43% of LLaVA’s parameter count, our model matches its performance on standard benchmarks and significantly outperforms it on closed-set visual reasoning tasks—particularly visual illusion recognition and spatial relation reasoning. Moreover, it achieves substantially faster inference, demonstrating strong competitiveness against lightweight state-of-the-art models such as TinyLLaVA and MobileVLM v2.

Technology Category

Application Category

📝 Abstract
In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, e.g., LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Transformer Architecture
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient Multi-modal Processing
Enhanced Image Understanding
Optimized Transformer Network
🔎 Similar Papers
No similar papers found.
H
Han Zhao
Westlake University
M
Min Zhang
Zhejiang University
W
Wei Zhao
Westlake University
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
Siteng Huang
Siteng Huang
Alibaba DAMO Academy | ZJU | Westlake University
Vision-language ModelsGenerative ModelsEmbodied AI
D
Donglin Wang
Westlake University