🤖 AI Summary
Vision Transformers (ViTs) suffer from high computational complexity in self-attention due to quadratic scaling with token sequence length. Existing token-level pruning methods independently prune query and key tokens, neglecting inter-token interactions and thus degrading accuracy. To address this, we propose a block-wise symmetric pruning and fusion framework: leveraging weight sharing between query and key projections, we prune only the upper-triangular portion of the attention matrix, enabling joint optimization; further, we introduce neighborhood-aware importance scoring and similarity-driven token fusion to explicitly model local structural and semantic correlations. The method is trained end-to-end on standard ViTs without auxiliary modules. On DeiT-T and DeiT-S, it achieves +1.3% and +2.0% top-1 accuracy on ImageNet, respectively, while reducing FLOPs by 50% and accelerating inference by 40%, outperforming state-of-the-art pruning approaches. Our core contribution is the first unified paradigm for visual token compression integrating symmetric pruning, neighborhood-aware evaluation, and similarity-based fusion.
📝 Abstract
Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT's $O(n^2)$ complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel {f Block-based Symmetric Pruning and Fusion} for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.