VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low inference efficiency and poor real-time deployability of large vision-language models (LVLMs) caused by long visual token sequences, this paper proposes VScan, a two-stage visual token compression framework. In the encoding stage, it integrates global and local scanning with similarity-based clustering to merge redundant visual tokens; in the decoding stage, it introduces adaptive token pruning for the first time at intermediate layers of the language model. This work is the first to systematically characterize the distribution pattern of visual token redundancy, breaking away from conventional single-stage pruning paradigms and establishing an encoder-decoder co-compression mechanism. Evaluated on LLaVA-NeXT-7B, VScan achieves 2.91× prefill speedup and 10× FLOPs reduction while retaining 95.4% of original performance. It outperforms state-of-the-art methods across all 16 benchmarks and maintains compatibility with four mainstream LVLMs.

Technology Category

Application Category

📝 Abstract
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$ imes$ speedup in prefilling and a 10$ imes$ reduction in FLOPs, while retaining 95.4% of the original performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Large Vision-Language Models
Optimizing visual token processing for efficiency
Balancing performance and speed in multi-modal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage visual token reduction framework
Global and local scans with token merging
Pruning at intermediate language model layers
🔎 Similar Papers
No similar papers found.
C
Ce Zhang
Carnegie Mellon University
Kaixin Ma
Kaixin Ma
Researcher, Apple
LLMsMultimodal Foundation ModelsAgents
Tianqing Fang
Tianqing Fang
Tencent AI Lab
Natural Language ProcessingAgentLanguage Models
W
Wenhao Yu
Tencent AI Lab
H
Hongming Zhang
Tencent AI Lab
Zhisong Zhang
Zhisong Zhang
City University of Hong Kong
Natural Language Processing
Y
Yaqi Xie
Carnegie Mellon University
K
Katia P. Sycara
Carnegie Mellon University
Haitao Mi
Haitao Mi
Principal Researcher, Tencent US
Large Language Models
D
Dong Yu
Tencent AI Lab