Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

📅 2026-03-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a key source of object hallucination in multimodal large language models: during generation, deep-layer attention mechanisms often drift from authentic visual inputs and regress toward noise introduced in early layers. The study reveals for the first time that this phenomenon stems from such regression to early-layer noise and demonstrates that visual anchors captured in intermediate layers are crucial for reliable generation. Building on this insight, the authors propose Cross-Layer Visual Anchoring (CLVA), a training-free method that enhances features from critical intermediate layers while suppressing noise propagation from earlier stages, thereby steering attention toward accurate visual regions. CLVA consistently mitigates hallucinations across diverse model architectures and benchmarks, achieving superior performance without additional computational or memory overhead.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.
Problem

Research questions and friction points this paper is trying to address.

object hallucination
multimodal large language models
visual attention drift
visual anchors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Visual Anchors
Multimodal Large Language Models
Object Hallucination
Attention Drift
Training-Free Method
🔎 Similar Papers
No similar papers found.
C
Chengxu Yang
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, China
J
Jingling Yuan
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, China
C
Chuang Hu
School of Computer Science, Wuhan University, China
Jiawei Jiang
Jiawei Jiang
Wuhan University
Machine Learning SystemFederated LearningGraph Learning