Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the excessive KV cache expansion and high inference overhead in large vision-language models (LVLMs) caused by a large number of image tokens in cross-attention layers, this paper proposes a training-free, sparsity-driven method for dynamic visual token pruning. Unlike existing approaches that exploit self-attention sparsity, our work is the first to explicitly model and leverage the intrinsic sparsity of cross-attention maps to compress visual features across cross-attention layers. The method requires no fine-tuning or additional training and is plug-and-play on architectures such as LLaMA-3.2-Vision. Experiments show that reducing visual tokens by 50% yields significant reductions in GPU memory consumption and inference latency, while preserving full performance on multimodal understanding benchmarks (e.g., MMBench, OCRBench). Our core contribution is the first explicit utilization of cross-attention sparsity for visual token compression—extending beyond the prevailing self-attention-centric paradigm and establishing a new direction for efficient LVLM inference.

Technology Category

Application Category

📝 Abstract

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Problem

Research questions and friction points this paper is trying to address.

Reduces KV cache size for image tokens in cross-attention layers

Prunes redundant visual features using sparse cross-attention maps

Lowers inference latency and memory usage without extra training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trims cross-attended visual features efficiently

Reduces KV cache size without extra training

Cuts inference latency and memory usage significantly

🔎 Similar Papers

Law of Vision Representation in MLLMs