ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of vision-language models caused by redundant visual tokens. Existing pruning methods, which rely on single-modality signals, struggle to balance efficiency and accuracy. To overcome this limitation, we propose a training-free, efficient pruning framework that uniquely integrates visual saliency with cross-attention signals from large language models (LLMs). Our approach employs a consensus-ranking mechanism to identify critical tokens and combines it with an encoder-guided token merging strategy to compress redundancy. This design effectively mitigates temporal misalignment and asymmetry between cross-modal signals, maximizing their complementary strengths. Evaluated on models such as LLaVA-1.5/NeXT and Video-LLaVA, our method significantly outperforms existing approaches—substantially reducing token count while preserving near-original accuracy and markedly lowering first-token latency and KV cache memory usage.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier -- preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
token reduction
visual saliency
cross-attention
efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

ConsensusDrop
vision-language models
token pruning
cross-modal saliency
efficiency optimization
🔎 Similar Papers
No similar papers found.
D
Dhruv Parikh
University of Southern California
H
Haoyang Fan
University of Southern California
Rajgopal Kannan
Rajgopal Kannan
MIS Division, Army Research Office; Electrical Engg, USC
Graph Learning and AnalyticsAccelerationOptimizationCPS
V
Viktor Prasanna
University of Southern California