ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models suffer significant performance degradation under high pruning ratios due to the difficulty of simultaneously preserving token importance and diversity. This work proposes ID-Selection, a novel strategy that unifies importance scoring and diversity constraints within a single iterative selection framework. By dynamically suppressing the scores of visually similar tokens, ID-Selection enables highly efficient extreme pruning without requiring additional training. Extensive experiments across five LVLM backbones and sixteen benchmarks demonstrate that the method retains only 16 visual tokens—corresponding to a 97.2% pruning ratio—and reduces FLOPs by over 97%, while preserving 91.8% of the original model’s performance, substantially outperforming current state-of-the-art pruning approaches.
📝 Abstract
Recent advances have explored visual token pruning to accelerate the inference of large vision-language models (LVLMs). However, existing methods often struggle to balance token importance and diversity: importance-based methods tend to retain redundant tokens, whereas diversity-based methods may overlook informative ones. This trade-off becomes especially problematic under high reduction ratios, where preserving only a small subset of visual tokens is critical. To address this issue, we propose ID-Selection, a simple yet effective token selection strategy for efficient LVLM inference. The key idea is to couple importance estimation with diversity-aware iterative selection: each token is first assigned an importance score, after which high-scoring tokens are selected one by one while the scores of similar tokens are progressively suppressed. In this way, ID-Selection preserves informative tokens while reducing redundancy in a unified selection process. Extensive experiments across 5 LVLM backbones and 16 main benchmarks demonstrate that ID-Selection consistently achieves superior performance and efficiency, especially under extreme pruning ratios. For example, on LLaVA-1.5-7B, ID-Selection prunes 97.2% of visual tokens, retaining only 16 tokens, while reducing inference FLOPs by over 97% and preserving 91.8% of the original performance, all without additional training.
Problem

Research questions and friction points this paper is trying to address.

visual token pruning
importance-diversity trade-off
large vision-language models
efficient inference
token redundancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token pruning
importance-diversity trade-off
efficient LVLM inference
iterative token selection
redundancy reduction
Z
Zhaohong Huang
Xiamen University, 361005, P.R. China
W
Wenjing Liu
Xiamen University, 361005, P.R. China
Yuxin Zhang
Yuxin Zhang
Xiamen University
Network sparsityModel compression
F
Fei Chao
Xiamen University, 361005, P.R. China
R
Rongrong Ji
Xiamen University, 361005, P.R. China