🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from high computational overhead and limited scalability due to the excessive number of image patch tokens relative to text tokens. Existing text-guided visual token pruning methods rely on the implicit assumption that textual tokens accurately reflect visual importance—yet we reveal systematic misalignments across causal, semantic, and spatial dimensions between modalities, severely undermining their efficacy. To address this, we propose VisionDrop: the first training-free, text-agnostic visual token pruning framework that preserves information under cross-modal alignment via purely vision-based self-attention. VisionDrop employs a hierarchical pruning strategy—first identifying dominant tokens, then progressively suppressing redundancy via lightweight contextual merging. Extensive experiments across multiple benchmarks demonstrate significant improvements over state-of-the-art methods; notably, it maintains strong robustness even under extremely low token budgets, achieving both efficient inference and superior task performance.
📝 Abstract
Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLM). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing methods, despite requiring no additional training or complex modifications. Its simple yet effective design enables efficient inference while preserving strong performance across tasks.