When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work identifies and formalizes a critical failure mode in multimodal retrieval-augmented generation (RAG) systems: attention distraction (AD), wherein retrieved textual content excessively suppresses visual attention, causing large vision-language models (LVLMs) to overlook crucial image evidence and produce incorrect answers. To address this issue without requiring additional training, the authors propose MAD-RAG, a novel inference-time method that decouples visual grounding from contextual fusion via dual-question prompting and harmonizes cross-modal attention through an attention blending mechanism. Evaluated on OK-VQA, E-VQA, and InfoSeek benchmarks, MAD-RAG achieves absolute accuracy improvements of 4.76%, 9.20%, and 6.18%, respectively, successfully rectifying 74.68% of previously failed cases while introducing negligible computational overhead and demonstrating broad compatibility across diverse LVLM architectures.

Technology Category

Application Category

📝 Abstract

While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Retrieval-Augmented Generation

Large Vision-Language Models

Attention Distraction

Visual Question Answering

RAG failures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Attention Distraction

Retrieval-Augmented Generation

Large Vision-Language Models