TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address the high GPU memory consumption and low inference efficiency of vision-language models (VLMs) caused by redundant visual tokens, this paper proposes a fine-tuning-free visual token pruning method fully compatible with FlashAttention and KV caching. Unlike existing greedy, attention-score-based heuristics, we formulate pruning as a vision-perception optimization problem—introducing a multidimensional cost function that jointly incorporates feature similarity, relative spatial distance, and absolute distance to the image center—and apply static pruning during the prefill phase. Evaluated across multiple VLM benchmarks, our method significantly outperforms prior token pruning approaches: it achieves up to 2.1× faster inference, reduces GPU memory usage by up to 47%, incurs zero training overhead, and preserves near-original model accuracy without degradation.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive less attention than text tokens, suggesting their lower importance during inference and potential for pruning. However, their methods encounter several challenges: reliance on greedy heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce extbf{TopV}, a compatible extbf{TO}ken extbf{P}runing with inference Time Optimization for fast and low-memory extbf{V}LM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores, we formulate token pruning as an optimization problem, accurately identifying important visual tokens while remaining compatible with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance, to measure the importance of each source visual token, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational resources in Vision-Language Models

Optimizes token pruning without additional training

Improves compatibility with FlashAttention and KV cache

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimization-based token pruning without retraining

Compatible with FlashAttention and KV cache

Visual-aware cost function for token importance

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference