🤖 AI Summary
Vision Mamba lacks efficient, training-free token compression methods; directly adapting ViT-based techniques incurs substantial performance degradation. To address this, we propose MTR (Model-agnostic Token Reduction), the first training-free, plug-and-play token reduction framework specifically designed for Mamba architectures. Its core is a structure-aware importance scoring mechanism that jointly considers positional sensitivity and local feature responsiveness—implemented via max-pooling and importance-based ranking—without relying on attention, thereby preserving sequential modeling integrity. Evaluated on the Vim-B backbone, MTR reduces FLOPs by ~40% while incurring only a 1.6% drop in ImageNet Top-1 accuracy. Crucially, it requires no fine-tuning and demonstrates consistent effectiveness across diverse downstream tasks and Mamba variants, significantly enhancing inference efficiency and deployment flexibility.
📝 Abstract
Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba's efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free extbf{M}amba extbf{T}oken extbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40% on the Vim-B backbone, with only a 1.6% drop in ImageNet performance without retraining.