Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing dynamic token compression methods for Vision Transformers (ViTs) suffer from substantial information loss, performance degradation, and reliance on post-training adaptation. Method: This paper proposes the first training-free, unified token compression framework, modeling pruning, merging, and other operations as explicit many-to-many token matrix transformations. It employs lightweight (learnable or fixed) transformation matrices to enable end-to-end inference acceleration without additional training. Contribution/Results: The method incurs zero training overhead and maintains compatibility across diverse downstream tasks—including image classification, semantic segmentation, object detection, depth estimation, and cross-modal generation. On DeiT-S, it reduces FLOPs by 40% and accelerates inference by 1.5×, with only a 0.1% top-1 accuracy drop. It significantly improves the compute–accuracy trade-off across multitask benchmarks and establishes the first training-free paradigm for efficient ViT inference.

Technology Category

Application Category

📝 Abstract

Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $ imes$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost in vision transformers

Unifies token compression as matrix transformation

Enables training-free acceleration with minimal accuracy loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified token matrix transformation framework

Training-free token compression method

Many-to-many token transforming generalization

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models