🤖 AI Summary
To address the excessive KV cache expansion and high inference overhead in large vision-language models (LVLMs) caused by a large number of image tokens in cross-attention layers, this paper proposes a training-free, sparsity-driven method for dynamic visual token pruning. Unlike existing approaches that exploit self-attention sparsity, our work is the first to explicitly model and leverage the intrinsic sparsity of cross-attention maps to compress visual features across cross-attention layers. The method requires no fine-tuning or additional training and is plug-and-play on architectures such as LLaMA-3.2-Vision. Experiments show that reducing visual tokens by 50% yields significant reductions in GPU memory consumption and inference latency, while preserving full performance on multimodal understanding benchmarks (e.g., MMBench, OCRBench). Our core contribution is the first explicit utilization of cross-attention sparsity for visual token compression—extending beyond the prevailing self-attention-centric paradigm and establishing a new direction for efficient LVLM inference.
📝 Abstract
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.