Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Vision Transformer (ViT) inference incurs substantial computational overhead, and existing token compression methods either require fine-tuning or suffer from representational degradation. Method: We propose Adaptive Token Merging (ATM), a fine-tuning-free, layer-adaptive token merging framework that achieves truly lossless compression. ATM introduces (i) a layer-wise adaptive similarity threshold enabling dynamic merging across layers and batches, and (ii) a matching mechanism jointly optimizing token similarity and size constraints to balance compression ratio and representation fidelity. Results: On DeiT-Tiny and DeiT-Small, ATM reduces FLOPs by over 30% with zero top-1 accuracy loss—outperforming all training-free methods and most fine-tuning-dependent approaches. This establishes a new paradigm for efficient ViT inference without compromising model integrity.

Technology Category

Application Category

📝 Abstract

Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes, particularly for the final layers, to minimize the information loss incurred from each merging operation. We empirically validate our method across a wide range of pretrained models, demonstrating that ATM not only outperforms all existing training-free methods but also surpasses most training-intensive approaches, even without additional training. Remarkably, training-free ATM achieves over a 30% reduction in FLOPs for the DeiT-T and DeiT-S models without any drop in their original accuracy.

Problem

Research questions and friction points this paper is trying to address.

Reduce Vision Transformers' computational overhead without fine-tuning

Prevent information loss in token compression techniques

Maintain model accuracy while significantly decreasing FLOPs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Token Merging ensures lossless compression

Layer-specific thresholds prevent undesirable token merging

Token matching considers similarity and merging sizes

🔎 Similar Papers

Efficient Visual Transformer by Learnable Token Merging