Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing retrieval head methods struggle to assess the role of visual evidence in long-context vision-language models, as conventional text-copying–based criteria are ill-suited for images. This work proposes a multimodal retrieval head probing approach that identifies attention heads critical for cross-modal evidence localization by analyzing question token attention scores toward both textual and visual evidence. For the first time, it reveals that such heads exhibit sparsity, intrinsicness, and causal importance, are partially shared across modalities yet dynamically vary, and can be leveraged for visual document ranking without fine-tuning. Experiments show that only 4.4–10.2% of heads contribute over 50% of the retrieval score; masking the top 5% of key heads significantly degrades performance on MMLongBench-Doc and SlideVQA; and applying the method on MMDocIR yields a 7.7-point gain in Recall@1.
πŸ“ Abstract
Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.
Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval
vision-language models
long-context modeling
retrieval heads
visual evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval heads
long-context vision-language models
attention sparsity
cross-modal attention
document retrieval
πŸ”Ž Similar Papers
No similar papers found.