GroupMamba: Efficient Group-Based Visual State Space Model

📅 2024-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor stability and low computational efficiency of Vision State Space Models (SSMs) during large-scale scaling, this paper proposes Grouped-Modulation Mamba—a novel architecture featuring Grouped Visual Selective Scan (VSSS), which partitions channels into four groups for parallel unidirectional spatial scanning; introduces channel-wise modulation to enhance cross-group modeling capability; and establishes a distillation-driven stable training paradigm. The method significantly improves the scalability and robustness of SSMs for vision tasks. The Tiny variant (23M parameters) achieves 83.3% top-1 accuracy on ImageNet-1K, with 26% higher parameter efficiency than the best prior Mamba model of comparable size. Moreover, it consistently outperforms existing SSM and Transformer baselines on COCO object detection/instance segmentation and ADE20K semantic segmentation.

Technology Category

Application Category

📝 Abstract
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art performance in computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes. We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. The Modulated Group Mamba layer also wraps the four VSSS blocks into a channel modulation operator to improve cross-channel communication. Furthermore, we introduce a distillation-based training objective to stabilize the training of large models, leading to consistent performance gains. Our comprehensive experiments demonstrate the merits of the proposed contributions, leading to superior performance over existing methods for image classification on ImageNet-1K, object detection, instance segmentation on MS-COCO, and semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves state-of-the-art performance with a classification top-1 accuracy of 83.3% on ImageNet-1K, while being 26% efficient in terms of parameters, compared to the best existing Mamba design of same model size. Code and models are available at: https://github.com/Amshaker/GroupMamba.
Problem

Research questions and friction points this paper is trying to address.

Addresses instability in SSM-based vision models
Improves efficiency of large-scale SSM models
Enhances cross-channel communication in group processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modulated group mamba layer for efficiency
Visual Single Selective Scanning block
Distillation-based training for stability
🔎 Similar Papers
No similar papers found.