Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies visual attention inertia—the tendency of multimodal large language models to persistently attend to previously fixated visual regions—as a key cause of cognitive hallucinations during generation, particularly in tasks requiring relational reasoning among objects. To address this, the authors propose Inertia-aware Visual Excitation (IVE), a training-free, plug-and-play method that dynamically analyzes attention trajectories to detect visual tokens deviating from historical attention trends. IVE applies targeted excitation to disrupt attentional inertia while simultaneously suppressing excessive local focus. Evaluated across multiple state-of-the-art multimodal large language models and hallucination benchmarks, IVE consistently and significantly mitigates cognitive hallucinations, with especially pronounced improvements on tasks demanding compositional and relational understanding.
📝 Abstract
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
Problem

Research questions and friction points this paper is trying to address.

visual inertia
cognitive hallucination
multimodal large language models
relational inference
attention dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual inertia
cognitive hallucination
attention dynamics
multimodal LLMs
training-free mitigation
🔎 Similar Papers
No similar papers found.
B
Boyang Gong
Tsinghua University, Beijing, China
Yu Zheng
Yu Zheng
MIT | Tsinghua University
Artificial IntelligenceAI for ScienceReinforcement Learning
F
Fanye Kong
Tsinghua University, Beijing, China
Jie Zhou
Jie Zhou
Tsinghua University
Graph Neural NetworksNatural Language Processing
J
Jiwen Lu
Tsinghua University, Beijing, China