Towards Efficient Vision State Space Models via Token Merging

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address the challenge of balancing token reduction and sequence modeling capability in visual state space models (SSMs), this paper proposes MaMe—the first dynamic token merging method tailored for SSMs. MaMe leverages the intrinsic state transition parameter Δ of SSMs to design a learnable token importance metric and introduces a structured reordering mechanism that explicitly preserves sequential information flow during merging. Crucially, it achieves this without auxiliary modules or external supervision. By jointly optimizing computational efficiency and modeling fidelity, MaMe attains 40–60% average token compression across image classification, video action recognition, and audio classification benchmarks—outperforming existing token pruning and merging approaches. It demonstrates strong generalization across modalities and is deployment-friendly due to its lightweight, architecture-agnostic design.

Technology Category

Application Category

📝 Abstract

State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $mathbfΔ$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.

Problem

Research questions and friction points this paper is trying to address.

Improving computational efficiency of vision State Space Models

Addressing token reduction challenges in sequential modeling

Preserving information flow during token merging operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token merging strategy for SSM vision models

State transition parameter measures token importance

Strategic token arrangements preserve sequential information

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference