The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the “recontamination” phenomenon in multimodal retrieval-augmented generation, where incorporating external textual evidence often causes models to abandon correct predictions, impair visual grounding, and exhibit positional bias. The study formally characterizes this issue for the first time, revealing its roots in suppressed visual attention and a preference for boundary positions, and further identifies a novel “success hallucination” effect. To mitigate these problems, the authors propose BAIR, a training-free, inference-time intervention framework that diagnoses attention matrices and integrates visual saliency restoration with a position-aware penalty mechanism for textual interference. Evaluated on benchmarks assessing medical factuality, social fairness, and geospatial reasoning, BAIR substantially restores multimodal grounding capabilities and enhances diagnostic reliability.

📝 Abstract

While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate "oracle" context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model's textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

recorruption

multimodal retrieval-augmented generation

textual bias

visual blindness

positional bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

recorruption

visual blindness

positional bias