When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies and formalizes a critical failure mode in multimodal retrieval-augmented generation (RAG) systems: attention distraction (AD), wherein retrieved textual content excessively suppresses visual attention, causing large vision-language models (LVLMs) to overlook crucial image evidence and produce incorrect answers. To address this issue without requiring additional training, the authors propose MAD-RAG, a novel inference-time method that decouples visual grounding from contextual fusion via dual-question prompting and harmonizes cross-modal attention through an attention blending mechanism. Evaluated on OK-VQA, E-VQA, and InfoSeek benchmarks, MAD-RAG achieves absolute accuracy improvements of 4.76%, 9.20%, and 6.18%, respectively, successfully rectifying 74.68% of previously failed cases while introducing negligible computational overhead and demonstrating broad compatibility across diverse LVLM architectures.

Technology Category

Application Category

📝 Abstract
While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation
Large Vision-Language Models
Attention Distraction
Visual Question Answering
RAG failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Distraction
Retrieval-Augmented Generation
Large Vision-Language Models
Visual Grounding
Training-Free Intervention
🔎 Similar Papers
No similar papers found.