🤖 AI Summary
Vision Transformers suffer from O(N²) computational complexity due to self-attention, leading to inefficient inference on high-resolution inputs. Existing [CLS] token–based pruning methods struggle in early layers where semantic representations are underdeveloped, hindering accurate token importance estimation. This work proposes Col-Ln, a training-free metric for token importance that introduces Rényi entropy into early-stage pruning of Vision Transformers for the first time. By eliminating reliance on the [CLS] token, Col-Ln enables precise identification of informative tokens starting from the very first layer. The method consistently outperforms existing pruning strategies across various Vision Transformer architectures and large vision-language models, achieving both higher accuracy and faster inference speeds on multiple benchmarks.
📝 Abstract
Vision Transformers (ViTs) achieve state-of-the-art performance but suffer from the $O(N^2)$ complexity of self-attention, making inference costly for high-resolution inputs. To address this bottleneck, token pruning has emerged as a critical technique to accelerate inference. Most existing methods rely on the [CLS] token to estimate patch importance. However, we argue that the [CLS] token can be unreliable in early layers where semantic representations are still immature. As a result, pruning in the early layer often leads to inaccurate importance estimation and unnecessary information loss. In this work, we propose a training-free token importance metric, namely Col-Ln, which is derived from Rényi entropy that enables the identification of informative tokens from the first layer of the network, thereby enabling more reliable pruning in token reduction. Extensive experiments on ViTs and Large Vision-Language Models (LVLMs) demonstrate that our approach consistently outperforms state-of-the-art pruning methods across diverse benchmarks.