Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
Existing visual token pruning methods for large vision-language models rely on a single attention source, which often introduces bias and yields limited effectiveness. This work proposes DeSAP, a novel approach that, for the first time, introduces decoupled similarity to jointly model fine-grained task relevance between visual features and textual tokens, while integrating visual saliency signals to enable task-aware, precise pruning. Evaluated on LLaVA-1.5-7B, DeSAP retains only 11.1% of visual tokens, reducing FLOPs by 10× and accelerating prefilling by 2.3×, while preserving 98.1% of the original performance—significantly outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under the guidance of both task-related and visual cues, enabling robust pruning even under aggressive pruning ratios. Extensive experiments across diverse benchmarks and architectures show that DeSAP consistently outperforms SOTA methods in both accuracy and efficiency. On LLaVA-1.5-7B, DeSAP achieves a 10 times FLOPs reduction and a 2.3 times prefill speedup by retaining only 11.1% of visual tokens, while maintaining 98.1% of the original performance.
Problem

Research questions and friction points this paper is trying to address.

token pruning
vision-language models
attention bias
computational overhead
cross-modal relevance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Similarity
Task-Aware Pruning
Token Pruning
Vision-Language Models
Cross-Modal Relevance