π€ AI Summary
Vision-language models (VLMs) suffer from high computational redundancy in visual tokens, and existing pruning methods rely on auxiliary parameters or post-hoc fine-tuning. Method: This paper proposes a training-free, text-guided zero-shot visual token sparsification framework. Its core innovations are: (1) leveraging cross-modal self-attention weights to dynamically assess visual token importance under textual semantic guidance; (2) introducing a layer-adaptive rank-based sparsity strategy that tailors pruning intensity to per-layer feature distributions; and (3) incorporating a token embedding recovery compression mechanism to mitigate information loss from pruning. Results: Evaluated on LLaVA, the method reduces FLOPs by 54% and CUDA latency by 37%, while preserving 97% of original accuracy. It achieves significant inference acceleration without any training or fine-tuningβfully parameter- and optimization-free.
π Abstract
In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.