🤖 AI Summary
State-space models (SSMs) like Mamba suffer from limited receptive fields due to their unidirectional scanning mechanism; existing bidirectional extensions require an additional global backward pass, incurring substantial computational overhead. This paper proposes LBMamba—a lightweight local bidirectional SSM module that integrates thread-level register-resident local reverse operations within a single forward selective scan, combined with an alternating-direction scanning strategy to construct the LBVim vision backbone. Crucially, it restores global receptive fields at zero extra scanning cost. Its core innovation lies in the first deep integration of local bidirectional modeling with register-level parallel computation, enabling native support for multi-scale vision tasks. Experiments demonstrate consistent and significant improvements: +0.8–1.6% top-1 accuracy on ImageNet-1K, +0.6–2.7% mIoU on ADE20K, +0.9% AP<sub>b</sub> and +1.1% AP<sub>m</sub> on COCO detection, and up to +3.06% AUC on whole-slide image pathology classification—achieving superior expressiveness without sacrificing efficiency.
📝 Abstract
Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel selective scan, has recently emerged as a linearly-scaling, efficient alternative to self-attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba-based computer-vision methods typically overcome this limitation by augmenting Mamba's global forward scan with a global backward scan, forming a bi-directional scan that restores a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi-directional SSM block that embeds a lightweight locally backward scan inside the forward selective scan and executes it entirely in per-thread registers. Building on LBMamba, we present LBVim, a scalable vision backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate the versatility of our approach on both natural images and whole slide images (WSIs). We show that our LBVim constantly offers a superior performance-throughput trade-off. That is under the same throughput, LBVim achieves 0.8% to 1.6% higher top-1 accuracy on the ImageNet-1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. We also integrate LBMamba into the SOTA pathology multiple instance learning (MIL) approach, MambaMIL, which uses single directional scan. Experiments on 3 public WSI classification datasets for show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy.