🤖 AI Summary
To address the challenge of jointly optimizing local feature extraction and global contextual modeling in visual representation learning, this paper proposes VCMamba—a computationally efficient vision backbone that synergistically integrates convolutional inductive bias with multi-directional state-space modeling. Its stage-wise hybrid architecture employs a convolutional stem in shallow layers for localized feature extraction and introduces multi-directional Mamba blocks in deeper layers to capture long-range dependencies, maintaining linear computational complexity. Crucially, VCMamba harmonizes CNNs’ locality with Mamba’s directional, global modeling capability. Evaluated on ImageNet-1K, the VCMamba-B variant achieves 82.6% top-1 accuracy—surpassing multiple state-of-the-art models—while reducing parameter count by 37%. On ADE20K semantic segmentation, it attains 47.1 mIoU, further demonstrating superior generalization. These results validate VCMamba’s effectiveness in balancing expressiveness, efficiency, and scalability across diverse vision tasks.
📝 Abstract
Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce extit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba's effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.