🤖 AI Summary
To address the challenge of balancing token reduction and sequence modeling capability in visual state space models (SSMs), this paper proposes MaMe—the first dynamic token merging method tailored for SSMs. MaMe leverages the intrinsic state transition parameter Δ of SSMs to design a learnable token importance metric and introduces a structured reordering mechanism that explicitly preserves sequential information flow during merging. Crucially, it achieves this without auxiliary modules or external supervision. By jointly optimizing computational efficiency and modeling fidelity, MaMe attains 40–60% average token compression across image classification, video action recognition, and audio classification benchmarks—outperforming existing token pruning and merging approaches. It demonstrates strong generalization across modalities and is deployment-friendly due to its lightweight, architecture-agnostic design.
📝 Abstract
State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $mathbfΔ$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly, our approach maintains robustness even under aggressive token reduction where existing methods undergo significant performance degradation.Beyond image classification, MaMe shows strong generalization capabilities across video and audio domains, establishing an effective approach for enhancing efficiency in diverse SSM applications.