Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work identifies a key source of object hallucination in multimodal large language models: during generation, deep-layer attention mechanisms often drift from authentic visual inputs and regress toward noise introduced in early layers. The study reveals for the first time that this phenomenon stems from such regression to early-layer noise and demonstrates that visual anchors captured in intermediate layers are crucial for reliable generation. Building on this insight, the authors propose Cross-Layer Visual Anchoring (CLVA), a training-free method that enhances features from critical intermediate layers while suppressing noise propagation from earlier stages, thereby steering attention toward accurate visual regions. CLVA consistently mitigates hallucinations across diverse model architectures and benchmarks, achieving superior performance without additional computational or memory overhead.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models often suffer from object hallucination. While existing research utilizes attention enhancement and visual retracing, we find these works lack sufficient interpretability regarding attention drift in final model stages. In this paper, we investigate the layer wise evolution of visual features and discover that hallucination stems from deep layer attention regressing toward initial visual noise from early layers. We observe that output reliability depends on acquiring visual anchors at intermediate layers rather than final layers. Based on these insights, we propose CLVA, which stands for Cross-Layer Visual Anchors, a training free method that reinforces critical mid layer features while suppressing regressive noise. This approach effectively pulls deep layer attention back to correct visual regions by utilizing essential anchors captured from attention dynamics. We evaluate our method across diverse architectures and benchmarks, demonstrating outstanding performance without significant increase in computational time and GPU memory.

Problem

Research questions and friction points this paper is trying to address.

object hallucination

multimodal large language models

visual attention drift

visual anchors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Visual Anchors

Multimodal Large Language Models

Object Hallucination

Attention Drift