SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

📅 2024-10-06

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 1

career value

205K/year

🤖 AI Summary

Vision-language models (VLMs) suffer from high computational redundancy in visual tokens, and existing pruning methods rely on auxiliary parameters or post-hoc fine-tuning. Method: This paper proposes a training-free, text-guided zero-shot visual token sparsification framework. Its core innovations are: (1) leveraging cross-modal self-attention weights to dynamically assess visual token importance under textual semantic guidance; (2) introducing a layer-adaptive rank-based sparsity strategy that tailors pruning intensity to per-layer feature distributions; and (3) incorporating a token embedding recovery compression mechanism to mitigate information loss from pruning. Results: Evaluated on LLaVA, the method reduces FLOPs by 54% and CUDA latency by 37%, while preserving 97% of original accuracy. It achieves significant inference acceleration without any training or fine-tuning—fully parameter- and optimization-free.

Technology Category

Application Category

📝 Abstract

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM's linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in VLMs

Optimizes visual tokens without training

Enhances efficiency while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided token optimization

Rank-based sparsification strategy

Token recycling for compactness

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM