🤖 AI Summary
Vision Transformers (ViTs) face significant challenges in deployment due to high computational overhead and incompatibility with real-time inference: existing compression methods either rely on end-to-end fine-tuning or introduce substantial latency, failing to meet online application requirements. To address this, we propose the Training-Free Vision Word Tokenizer (VWT), the first approach to adapt natural language tokenization principles to vision—establishing a training-free paradigm for token compression. VWT performs unsupervised clustering based on inter- and intra-patch similarity statistics, generating hierarchical visual subword/word representations, and incorporates an online dynamic token merging and reuse mechanism. Experiments demonstrate that VWT reduces power consumption by up to 19%, increases inference latency by ≤20%, and significantly outperforms 8-bit quantization and state-of-the-art token merging methods—which incur over twice the latency overhead—achieving superior trade-offs among energy efficiency, latency, and accuracy.
📝 Abstract
The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression requires additional end-to-end fine-tuning or incurs a significant drawback to runtime, thus making them ill-suited for online inference. We introduce the $ extbf{Visual Word Tokenizer}$ (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups patches (visual subwords) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for compression. Experimentally, we demonstrate a reduction in wattage of up to 19% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to $2 imes$ or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.