Dynamic Vision Mamba

📅 2025-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address token redundancy (causing train-inference inconsistency) and block redundancy (impeding inference speed) in Mamba-based vision models, this paper proposes dynamic token reordering and adaptive SSM block selection. We introduce the first image-level dynamic token pruning and block-level dynamic routing methods tailored for the Mamba architecture, enabling precise computational cost control while preserving train-inference consistency. Our approach jointly optimizes model structure under FLOPs-aware constraints and refines state-space modeling, ensuring cross-model and cross-task generalizability. On Vim-S, it reduces FLOPs by 35.2% with only a 1.7% top-1 accuracy drop. Extensive experiments across diverse Mamba vision models—including classification and object detection tasks—demonstrate consistent effectiveness and robustness.

Technology Category

Application Category

📝 Abstract
Mamba-based vision models have gained extensive attention as a result of being computationally more efficient than attention-based models. However, spatial redundancy still exists in these models, represented by token and block redundancy. For token redundancy, we analytically find that early token pruning methods will result in inconsistency between training and inference or introduce extra computation for inference. Therefore, we customize token pruning to fit the Mamba structure by rearranging the pruned sequence before feeding it into the next Mamba block. For block redundancy, we allow each image to select SSM blocks dynamically based on an empirical observation that the inference speed of Mamba-based vision models is largely affected by the number of SSM blocks. Our proposed method, Dynamic Vision Mamba (DyVM), effectively reduces FLOPs with minor performance drops. We achieve a reduction of 35.2% FLOPs with only a loss of accuracy of 1.7% on Vim-S. It also generalizes well across different Mamba vision model architectures and different vision tasks. Our code will be made public.
Problem

Research questions and friction points this paper is trying to address.

Addresses spatial redundancy in Mamba-based vision models
Optimizes token pruning for Mamba structure efficiency
Dynamically selects SSM blocks to reduce computational costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Custom token pruning for Mamba structure
Dynamic SSM block selection per image
Reduces FLOPs with minimal accuracy loss
🔎 Similar Papers