Tell Model Where to Look: Mitigating Hallucinations in MLLMs by Vision-Guided Attention

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from visual hallucinations during image understanding due to insufficient visual attention localization. To address this, we propose Vision-Guided Attention (VGA), a training-free mechanism that leverages the semantic content of visual tokens to establish fine-grained visual grounding and dynamically steer attention toward relevant image regions. VGA further incorporates an adaptive suppression module that inhibits already-described regions during inference, enabling continuous hallucination mitigation. The method is compatible with efficient attention implementations such as FlashAttention, incurring only a 4.36% latency overhead per forward pass. Evaluated across multiple MLLM architectures and mainstream hallucination benchmarks—including POPE and HallusionBench—VGA achieves state-of-the-art performance, significantly reducing erroneous generations in image captioning and visual question answering. Our results demonstrate that explicit, semantics-driven visual guidance is critical for enhancing the robustness of multimodal understanding.

Technology Category

Application Category

📝 Abstract

Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.

Problem

Research questions and friction points this paper is trying to address.

MLLMs' limited visual localization causes hallucination issues

Vision-guided attention improves focus on relevant visual regions

Method reduces hallucinations while maintaining low computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-guided attention directs model focus to relevant visual regions

Training-free method constructs precise visual grounding from tokens

Dynamic region suppression refines guidance during caption generation

🔎 Similar Papers

Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models