Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Existing retrieval head methods struggle to assess the role of visual evidence in long-context vision-language models, as conventional text-copying–based criteria are ill-suited for images. This work proposes a multimodal retrieval head probing approach that identifies attention heads critical for cross-modal evidence localization by analyzing question token attention scores toward both textual and visual evidence. For the first time, it reveals that such heads exhibit sparsity, intrinsicness, and causal importance, are partially shared across modalities yet dynamically vary, and can be leveraged for visual document ranking without fine-tuning. Experiments show that only 4.4–10.2% of heads contribute over 50% of the retrieval score; masking the top 5% of key heads significantly degrades performance on MMLongBench-Doc and SlideVQA; and applying the method on MMDocIR yields a 7.7-point gain in Recall@1.

📝 Abstract

Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

vision-language models

long-context modeling

retrieval heads

visual evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval heads

long-context vision-language models

attention sparsity