🤖 AI Summary
To address the inefficiency of global modeling and the superlinear computational growth in processing high-resolution images for vision tasks, this paper introduces VMamba—the first efficient state-space model family designed specifically for vision. The core innovation is a two-dimensional selective scanning (SS2D) mechanism that enables linear-complexity global contextual modeling over images via four-directional feature traversal, effectively bridging the fundamental gap between 1D sequential modeling and the non-sequential, 2D structure of images. Complemented by hardware-aware operator fusion and co-design optimizations at both architectural and implementation levels, VMamba achieves state-of-the-art or ViT/ConvNeXt–competitive accuracy on image classification, detection, and segmentation benchmarks. Crucially, its inference latency scales strictly linearly with input resolution, markedly improving efficiency for large-scale image processing.
📝 Abstract
Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space language model, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.