Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

High-ratio token compression in Vision Mamba often leads to critical knowledge loss and substantial accuracy degradation. To address this, we propose R-MeeTo—a novel framework introducing a “minute-level token merging + lightweight retraining” paradigm. It employs structural-aware token merging for efficient compression and synergistically combines knowledge-reconstruction retraining with parameter fine-tuning for joint optimization. On ImageNet-1K, R-MeeTo boosts Vim-Ti’s top-1 accuracy by 35.9% within just three retraining epochs; retraining time for Vim-Ti/S/B is merely 5/7/17 minutes, respectively. Under 1.2–1.5× inference speedup, Vim-S incurs only a 1.3% accuracy drop. To our knowledge, R-MeeTo is the first method to break the long-standing trade-off between high compression ratio and high accuracy in Mamba-based vision models. It establishes a new paradigm for efficient deployment of visual state space models, enabling rapid, accurate, and resource-efficient inference without architectural modification.

Technology Category

Application Category

📝 Abstract

Vision Mamba (e.g., Vim) has successfully been integrated into computer vision, and token reduction has yielded promising outcomes in Vision Transformers (ViTs). However, token reduction performs less effectively on Vision Mamba compared to ViTs. Pruning informative tokens in Mamba leads to a high loss of key knowledge and bad performance. This makes it not a good solution for enhancing efficiency in Mamba. Token merging, which preserves more token information than pruning, has demonstrated commendable performance in ViTs. Nevertheless, vanilla merging performance decreases as the reduction ratio increases either, failing to maintain the key knowledge in Mamba. Re-training the token-reduced model enhances the performance of Mamba, by effectively rebuilding the key knowledge. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are re-trained within 5/7/17 minutes, and Vim-S only drop 1.3% with 1.2x (up to 1.5x) speed up in inference.

Problem

Research questions and friction points this paper is trying to address.

Token reduction inefficiency in Vision Mamba

Key knowledge loss in token pruning

Fast recovery via re-training in Vision Mamba

Innovation

Methods, ideas, or system contributions that make the work stand out.

Re-training token-reduced Mamba models

Token merging preserves key knowledge

Minute-level fast recovery achieved

🔎 Similar Papers

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining