Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the computational inefficiency of vision-language models caused by excessive visual tokens, a challenge exacerbated by existing compression methods that struggle to balance token importance and semantic diversity. To overcome this, the authors propose PruneSID, a training-free two-stage compression framework. It first clusters tokens via Principal Semantic Component Analysis (PSCA) to preserve diverse semantics, then applies intra-cluster Non-Maximum Suppression (NMS) to eliminate redundancy, guided by an image complexity-aware dynamic compression ratio. PruneSID is the first method to jointly optimize token importance and diversity. Evaluated on LLaVA-1.5, it retains only 11.1% of tokens while achieving 96.3% accuracy; under extreme compression (5.6% tokens), it maintains 92.8% accuracy on LLaVA-NeXT, yielding a 7.8× speedup in inference and outperforming prior approaches, with demonstrated generalizability across models and multimodal tasks.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID}{https://github.com/ZhengyaoFang/PruneSID.

Problem

Research questions and friction points this paper is trying to address.

vision-language models

visual token compression

redundancy

importance preservation

information diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Token Compression

Synergistic Importance-Diversity

Training-free Pruning