Efficient Online Inference of Vision Transformers by Training-Free Tokenization

📅 2024-11-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Vision Transformers (ViTs) face significant challenges in deployment due to high computational overhead and incompatibility with real-time inference: existing compression methods either rely on end-to-end fine-tuning or introduce substantial latency, failing to meet online application requirements. To address this, we propose the Training-Free Vision Word Tokenizer (VWT), the first approach to adapt natural language tokenization principles to vision—establishing a training-free paradigm for token compression. VWT performs unsupervised clustering based on inter- and intra-patch similarity statistics, generating hierarchical visual subword/word representations, and incorporates an online dynamic token merging and reuse mechanism. Experiments demonstrate that VWT reduces power consumption by up to 19%, increases inference latency by ≤20%, and significantly outperforms 8-bit quantization and state-of-the-art token merging methods—which incur over twice the latency overhead—achieving superior trade-offs among energy efficiency, latency, and accuracy.

Technology Category

Application Category

📝 Abstract

The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the $ extbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to $2 imes$ or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.

Problem

Research questions and friction points this paper is trying to address.

Reduce energy costs for vision transformers without performance loss

Enable efficient online real-time inference for vision transformers

Minimize runtime impact while compressing visual token sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free tokenization for vision transformers

Groups frequent patches into visual words

Leverages intra/inter-image statistics for compression

🔎 Similar Papers

No similar papers found.