Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs

πŸ“… 2026-02-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the hallucination problem in large vision-language models (LVLMs), which often arises from attention biases that lead to outputs inconsistent with visual inputs or instructions. The study is the first to uncover and leverage the internal positive attention dynamics (PAD) within LVLMs, constructing PAD maps to identify semantically critical visual regions. Building on this insight, the authors propose a training-free, adaptive attention intervention method: it enhances attention to key regions by scaling with per-head median absolute deviation and mitigates attention sink effects through a systematic token compensation mechanism. Evaluated across multiple LVLMs and benchmarks, the approach significantly improves visual grounding accuracy, effectively reduces hallucination rates, and enhances the reliability of multimodal reasoning.

Technology Category

Application Category

πŸ“ Abstract
LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

hallucination
LVLMs
visual grounding
attention dynamics
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Positive Attention Dynamics
Hallucination Mitigation
Attention Sink
Training-Free Intervention
Visual Grounding
πŸ”Ž Similar Papers
No similar papers found.
G
Guangtao Lyu
School of Electronic Engineering, Xidian University, Xi’an, China
Q
Qi Liu
School of Electronic Engineering, Xidian University, Xi’an, China
Chenghao Xu
Chenghao Xu
EPFL
RoboticsDynamic SLAMActive Vision
J
Jiexi Yan
School of Computer Science and Technology, Xidian University, Xi’an, China
Muli Yang
Muli Yang
Institute for Infocomm Research (I2R), A*STAR, Singapore
Computer VisionMachine LearningOpen-World LearningMultimodal Modeling
Xueting Li
Xueting Li
NVIDIA Research
Computer Vision
F
Fen Fang
Institute for Infocomm Research, A*STAR, Singapore
Cheng Deng
Cheng Deng
University of Edinburgh
On-device LLMNLPGeoAI