VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the low inference efficiency and poor real-time deployability of large vision-language models (LVLMs) caused by long visual token sequences, this paper proposes VScan, a two-stage visual token compression framework. In the encoding stage, it integrates global and local scanning with similarity-based clustering to merge redundant visual tokens; in the decoding stage, it introduces adaptive token pruning for the first time at intermediate layers of the language model. This work is the first to systematically characterize the distribution pattern of visual token redundancy, breaking away from conventional single-stage pruning paradigms and establishing an encoder-decoder co-compression mechanism. Evaluated on LLaVA-NeXT-7B, VScan achieves 2.91× prefill speedup and 10× FLOPs reduction while retaining 95.4% of original performance. It outperforms state-of-the-art methods across all 16 benchmarks and maintains compatibility with four mainstream LVLMs.

Technology Category

Application Category

📝 Abstract

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$ imes$ speedup in prefilling and a 10$ imes$ reduction in FLOPs, while retaining 95.4% of the original performance.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Large Vision-Language Models

Optimizing visual token processing for efficiency

Balancing performance and speed in multi-modal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage visual token reduction framework

Global and local scans with token merging

Pruning at intermediate language model layers

🔎 Similar Papers

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference