🤖 AI Summary
Vision Mamba’s inference efficiency is hindered by fixed high-density token inputs; existing token pruning or merging methods often discard critical visual information and lack adaptability to image complexity. To address this, we propose CF-ViM—a Confidence-guided Fine-grained Vision Mamba framework—introducing the first confidence-based dynamic multi-scale inference mechanism. It begins with coarse-grained, large-block tokens for efficient global modeling, then adaptively triggers fine-grained recomputation only in low-confidence local regions, leveraging Vision Mamba’s sequential modeling capability for progressive “coarse-to-fine” refinement. Crucially, CF-ViM avoids explicit token compression, preserving full fine-grained representation fidelity while enabling complexity-aware allocation of computational resources. Evaluated on ImageNet, CF-ViM significantly outperforms the baseline Vision Mamba and state-of-the-art token reduction methods, achieving superior trade-offs between accuracy and inference efficiency.
📝 Abstract
Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss, as they discard or compress token representations. This problem is exacerbated when applied uniformly to fine-grained token representations across all images, regardless of visual complexity. We observe that not all inputs require fine-grained processing. Simple images can be effectively handled at coarse resolution, while only complex ones may warrant refinement. Based on this insight, we propose extit{Coarse-to-Fine Vision Mamba (CF-ViM)}, an adaptive framework for efficient inference. CF-ViM first performs coarse-grained inference by dividing the input image into large patches, significantly reducing the token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover critical visual details with minimal additional cost. This dynamic resolution assignment strategy allows CF-ViM to allocate computation adaptively according to image complexity, ensuring efficient processing without compromising essential visual information. Experiments on ImageNet demonstrate that CF-ViM outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.