ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-resolution visual tokens in large vision-language models (LVLMs) incur quadratic computational overhead, yet existing compression methods often neglect attention misalignment and semantic redundancy, struggling to balance efficiency and performance. This work proposes ASAP—a training-free, KV-cache-compatible visual token pruning approach that systematically identifies and addresses attention misalignment for the first time. ASAP dynamically corrects attention offsets via a bidirectional soft attention mask and eliminates semantic redundancy through a weighted soft merging mechanism, preserving patches with high information density. Evaluated on LLaVA-NeXT-7B, ASAP reduces visual token FLOPs by approximately 80% while incurring only a 0.98% accuracy drop, retaining 99.02% of the original performance.

Technology Category

Application Category

📝 Abstract
While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.
Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models
token redundancy
attention shift
computational cost
visual token pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

attention shift
token pruning
semantic redundancy
KV-Cache-compatible
training-free
🔎 Similar Papers
No similar papers found.