🤖 AI Summary
This work addresses the quadratic computational complexity of Vision Transformers (ViTs) caused by self-attention and the limitation of existing training-free token merging methods, which assume uniform token importance and thus risk discarding critical information under aggressive compression. The authors propose AdaMerge, a novel framework that, for the first time, integrates saliency-aware similarity weighting with layer-wise adaptive merging intensity—dynamically preserving high-saliency tokens and adjusting compression rates based on input redundancy, all without requiring retraining. Token saliency is estimated via column-wise feature affinity centrality, incorporated into bipartite matching scores, while precomputed inter-layer similarity statistics guide adaptive control of the number of merged tokens per layer. On ImageNet-1k, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs regimes for ViT-B/16, achieving only a 1.06% Top-1 accuracy drop at 13.4G FLOPs and significantly advancing the accuracy–efficiency Pareto frontier.
📝 Abstract
The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.