Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

📅 2026-05-23

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the prevalent issues of visual blurriness and object hallucination in multimodal large language models, which often stem from attention dispersion. For the first time, it maps human attention distraction mechanisms onto the model’s spatial inconsistency and temporal decay characteristics, uncovering their intrinsic link to hallucination. The authors propose a plug-and-play correction architecture that mitigates attention dispersion without requiring additional training, leveraging cross-head attention enhancement and dynamic historical attention reinforcement. Evaluated across multiple benchmarks and model architectures, the method significantly reduces hallucination rates, improves visual description accuracy, and enhances model robustness. Furthermore, the study provides a theoretical analysis of how attention dispersion affects model generalization.

📝 Abstract

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

Problem

Research questions and friction points this paper is trying to address.

object hallucinations

attention distraction

visual perception

multimodal large language models

spatial inconsistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention distraction

visual hallucination

multimodal large language models