MambaVision: A Hybrid Mamba-Transformer Vision Backbone

๐Ÿ“… 2024-07-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 38
โœจ Influential: 5
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of simultaneously achieving computational efficiency and effective long-range spatial dependency modeling in vision backbone networks, this paper introduces MambaVisionโ€”the first hybrid Mamba-Transformer architecture designed specifically for vision tasks. Methodologically, it features: (1) a re-designed, vision-optimized Mamba module adapted to 2D feature structures; (2) a hierarchical hybrid architecture that employs linear-complexity Mamba blocks in shallow layers for local pattern capture and lightweight self-attention in deeper layers for enhanced global modeling; and (3) an end-to-end supervised training paradigm. Evaluated on ImageNet-1K, MambaVision achieves state-of-the-art Top-1 accuracy (83.7%) with high throughput. It also significantly outperforms ViT- and CNN-based backbones of comparable scale on MS COCO object detection and ADE20K semantic/instance segmentation, demonstrating the effectiveness and generalization superiority of the Mamba-ViT fusion paradigm.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. In addition, we conduct a comprehensive ablation study on the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results demonstrate that equipping the Mamba architecture with several self-attention blocks at the final layers greatly improves the modeling capacity to capture long-range spatial dependencies. Based on our findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For Image classification on ImageNet-1K dataset, MambaVision model variants achieve a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput. In downstream tasks such as object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones and demonstrates more favorable performance. Code: https://github.com/NVlabs/MambaVision.
Problem

Research questions and friction points this paper is trying to address.

Hybrid Mamba-Transformer for vision applications
Enhancing Mamba for efficient visual feature modeling
Improving long-range spatial dependencies with self-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Mamba-Transformer for vision tasks
Enhanced Mamba with self-attention blocks
Hierarchical models for SOTA performance
๐Ÿ”Ž Similar Papers
No similar papers found.