AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the quadratic computational complexity of Vision Transformers (ViTs) caused by self-attention and the limitation of existing training-free token merging methods, which assume uniform token importance and thus risk discarding critical information under aggressive compression. The authors propose AdaMerge, a novel framework that, for the first time, integrates saliency-aware similarity weighting with layer-wise adaptive merging intensity—dynamically preserving high-saliency tokens and adjusting compression rates based on input redundancy, all without requiring retraining. Token saliency is estimated via column-wise feature affinity centrality, incorporated into bipartite matching scores, while precomputed inter-layer similarity statistics guide adaptive control of the number of merged tokens per layer. On ImageNet-1k, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs regimes for ViT-B/16, achieving only a 1.06% Top-1 accuracy drop at 13.4G FLOPs and significantly advancing the accuracy–efficiency Pareto frontier.

📝 Abstract

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

token merging

salience

training-free acceleration

self-attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

salience-aware

adaptive token merging

training-free acceleration