Efficient Online Inference of Vision Transformers by Training-Free Tokenization

📅 2024-11-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) face significant challenges in deployment due to high computational overhead and incompatibility with real-time inference: existing compression methods either rely on end-to-end fine-tuning or introduce substantial latency, failing to meet online application requirements. To address this, we propose the Training-Free Vision Word Tokenizer (VWT), the first approach to adapt natural language tokenization principles to vision—establishing a training-free paradigm for token compression. VWT performs unsupervised clustering based on inter- and intra-patch similarity statistics, generating hierarchical visual subword/word representations, and incorporates an online dynamic token merging and reuse mechanism. Experiments demonstrate that VWT reduces power consumption by up to 19%, increases inference latency by ≤20%, and significantly outperforms 8-bit quantization and state-of-the-art token merging methods—which incur over twice the latency overhead—achieving superior trade-offs among energy efficiency, latency, and accuracy.

Technology Category

Application Category

📝 Abstract
The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the $ extbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to $2 imes$ or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.
Problem

Research questions and friction points this paper is trying to address.

Reduce energy costs for vision transformers without performance loss
Enable efficient online real-time inference for vision transformers
Minimize runtime impact while compressing visual token sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free tokenization for vision transformers
Groups frequent patches into visual words
Leverages intra/inter-image statistics for compression
🔎 Similar Papers
No similar papers found.
L
Leonidas Gee
Predictive Analytics Lab, University of Sussex, UK
W
Wing Yan Li
University of Surrey, UK
V
V. Sharmanska
Predictive Analytics Lab, University of Sussex, UK
Novi Quadrianto
Novi Quadrianto
Professor of Machine Learning, University of Sussex UK, BCAM Spain, Monash Indonesia
Trustworthy Machine Learning