MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba

📅 2025-11-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Mamba’s inference efficiency is hindered by fixed high-density token inputs; existing token pruning or merging methods often discard critical visual information and lack adaptability to image complexity. To address this, we propose CF-ViM—a Confidence-guided Fine-grained Vision Mamba framework—introducing the first confidence-based dynamic multi-scale inference mechanism. It begins with coarse-grained, large-block tokens for efficient global modeling, then adaptively triggers fine-grained recomputation only in low-confidence local regions, leveraging Vision Mamba’s sequential modeling capability for progressive “coarse-to-fine” refinement. Crucially, CF-ViM avoids explicit token compression, preserving full fine-grained representation fidelity while enabling complexity-aware allocation of computational resources. Evaluated on ImageNet, CF-ViM significantly outperforms the baseline Vision Mamba and state-of-the-art token reduction methods, achieving superior trade-offs between accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss, as they discard or compress token representations. This problem is exacerbated when applied uniformly to fine-grained token representations across all images, regardless of visual complexity. We observe that not all inputs require fine-grained processing. Simple images can be effectively handled at coarse resolution, while only complex ones may warrant refinement. Based on this insight, we propose extit{Coarse-to-Fine Vision Mamba (CF-ViM)}, an adaptive framework for efficient inference. CF-ViM first performs coarse-grained inference by dividing the input image into large patches, significantly reducing the token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover critical visual details with minimal additional cost. This dynamic resolution assignment strategy allows CF-ViM to allocate computation adaptively according to image complexity, ensuring efficient processing without compromising essential visual information. Experiments on ImageNet demonstrate that CF-ViM outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
Problem

Research questions and friction points this paper is trying to address.

Reduces token count in Vision Mamba without information loss.
Adapts processing resolution based on image complexity.
Enhances efficiency while preserving critical visual details.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive coarse-to-fine inference reduces tokens dynamically
Low-confidence regions are reprocessed at finer resolution selectively
Dynamic resolution assignment allocates computation based on image complexity
S
Shanhui Liu
The University of Sydney
R
Rui Xu
Wuhan University
Yunke Wang
Yunke Wang
University of Sydney
generative modelroboticsimitation learningreinforcement learning