🤖 AI Summary
Multimodal large language models (MLLMs) suffer from visual hallucinations during image understanding due to insufficient visual attention localization. To address this, we propose Vision-Guided Attention (VGA), a training-free mechanism that leverages the semantic content of visual tokens to establish fine-grained visual grounding and dynamically steer attention toward relevant image regions. VGA further incorporates an adaptive suppression module that inhibits already-described regions during inference, enabling continuous hallucination mitigation. The method is compatible with efficient attention implementations such as FlashAttention, incurring only a 4.36% latency overhead per forward pass. Evaluated across multiple MLLM architectures and mainstream hallucination benchmarks—including POPE and HallusionBench—VGA achieves state-of-the-art performance, significantly reducing erroneous generations in image captioning and visual question answering. Our results demonstrate that explicit, semantics-driven visual guidance is critical for enhancing the robustness of multimodal understanding.
📝 Abstract
Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model's focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.