ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the high computational cost of vision-language models caused by redundant visual tokens. Existing pruning methods, which rely on single-modality signals, struggle to balance efficiency and accuracy. To overcome this limitation, we propose a training-free, efficient pruning framework that uniquely integrates visual saliency with cross-attention signals from large language models (LLMs). Our approach employs a consensus-ranking mechanism to identify critical tokens and combines it with an encoder-guided token merging strategy to compress redundancy. This design effectively mitigates temporal misalignment and asymmetry between cross-modal signals, maximizing their complementary strengths. Evaluated on models such as LLaVA-1.5/NeXT and Video-LLaVA, our method significantly outperforms existing approaches—substantially reducing token count while preserving near-original accuracy and markedly lowering first-token latency and KV cache memory usage.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier -- preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

token reduction

visual saliency

cross-attention

efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

ConsensusDrop

vision-language models

token pruning

cross-modal saliency

efficiency optimization

🔎 Similar Papers

No similar papers found.

Authors to Follow