🤖 AI Summary
To address the limitation of fixed scanning orders in Vision State Space Models (VSSMs)—which disrupt image spatial locality and hinder semantic modeling—this paper proposes a data-driven adaptive scanning (DAS) mechanism, the first to jointly optimize scanning paths and regional representations in an end-to-end differentiable manner. Leveraging DAS, we introduce DAMamba, an efficient and general-purpose Vision Mamba backbone that achieves linear computational complexity while preserving a global receptive field. Extensive experiments demonstrate that DAMamba consistently outperforms existing Vision Mamba variants across diverse vision tasks: image classification (84.1% top-1 accuracy on ImageNet-1K), object detection, instance segmentation, and semantic segmentation. Moreover, it surpasses several state-of-the-art CNNs and ViTs. This work establishes the first semantic-aware, fully differentiable, and computationally efficient serialization paradigm for visual state space modeling.
📝 Abstract
State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.