A Survey on Mamba Architecture for Vision Applications

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Transformers face scalability limitations in vision tasks due to the quadratic computational complexity of self-attention. This paper systematically surveys recent advances of the Mamba architecture—grounded in state space models (SSMs)—in computer vision, introducing novel frameworks such as ViM and VideoMamba that pioneer the adaptation of SSMs to joint global-local modeling and spatiotemporal understanding. Key innovations include bidirectional and selective scanning mechanisms, cross-scan modules, hierarchical architectural design, and optimized positional embeddings—collectively enhancing long-range dependency capture and fine-grained local feature perception. Experiments demonstrate that this paradigm achieves linear time complexity while outperforming prior methods in image classification, object detection, semantic segmentation, and video understanding—delivering both superior efficiency and accuracy. The work establishes SSM-driven vision modeling as a compelling alternative to attention-based architectures.

Technology Category

Application Category

📝 Abstract

Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state-space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross-scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.

Problem

Research questions and friction points this paper is trying to address.

Addressing scalability in vision tasks

Enhancing image and video understanding

Optimizing Mamba for feature extraction

Innovation

Methods, ideas, or system contributions that make the work stand out.

State-space models for scalability

Bidirectional scanning mechanisms

Hierarchical design optimization

🔎 Similar Papers

MambaVision: A Hybrid Mamba-Transformer Vision Backbone