When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
This work addresses the challenge of hallucinations in multimodal large language models during visual reasoning, which often contradict image content and evade detection by standard attention mechanisms. The study reveals, for the first time, a correlation between high-frequency structures in visual attention maps—quantified via layer-wise Laplacian energy—and hallucination generation, showing that hallucinatory tendencies concentrate in specific network layers while correct answers briefly re-emerge in subsequent layers. Building on this insight, the authors propose LaSCD, a training-free contrastive decoding method that applies a closed-form remapping to the next-token logits at critical layers. Evaluated across multiple hallucination-focused and general multimodal benchmarks, LaSCD substantially reduces hallucination rates without compromising the model’s original performance, demonstrating its potential as an efficient and reliable decoding paradigm.
📝 Abstract
Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.
Problem

Research questions and friction points this paper is trying to address.

visual hallucination
multimodal large language models
visual attention
grounded question answering
image content contradiction
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual hallucination
visual attention structure
Laplacian energy
contrastive decoding
multimodal LLMs