Training-free Token Reduction for Vision Mamba

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Vision Mamba lacks efficient, training-free token compression methods; directly adapting ViT-based techniques incurs substantial performance degradation. To address this, we propose MTR (Model-agnostic Token Reduction), the first training-free, plug-and-play token reduction framework specifically designed for Mamba architectures. Its core is a structure-aware importance scoring mechanism that jointly considers positional sensitivity and local feature responsiveness—implemented via max-pooling and importance-based ranking—without relying on attention, thereby preserving sequential modeling integrity. Evaluated on the Vim-B backbone, MTR reduces FLOPs by ~40% while incurring only a 1.6% drop in ImageNet Top-1 accuracy. Crucially, it requires no fine-tuning and demonstrates consistent effectiveness across diverse downstream tasks and Mamba variants, significantly enhancing inference efficiency and deployment flexibility.

Technology Category

Application Category

📝 Abstract

Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba's efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free extbf{M}amba extbf{T}oken extbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40% on the Vim-B backbone, with only a 1.6% drop in ImageNet performance without retraining.

Problem

Research questions and friction points this paper is trying to address.

Explores token reduction in Vision Mamba without training

Addresses performance drop from ViT token reduction in Mamba

Proposes training-free Mamba Token Reduction (MTR) framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free Mamba Token Reduction framework

Mamba structure-aware importance score

Plug-and-play component for Mamba models

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference