🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently generate language hallucinations inconsistent with image content during multimodal reasoning. This work identifies that attention maps from the visual encoder are more focused and inherently more robust against hallucinations than the LVLM’s final visual attention. Leveraging this insight, we propose a lightweight, decoding-stage intervention paradigm: injecting visual encoder attention into intermediate layers of the language model to enable token-level adaptive visual guidance—without fine-tuning. Our method comprises three components: visual attention extraction, cross-modal attention injection, and dynamic mid-layer steering. Evaluated on multiple hallucination benchmark datasets, it achieves state-of-the-art performance, significantly reducing factual errors and improving image-text alignment while preserving generation fluency and inference efficiency. The core contribution is the first empirical revelation of the hallucination-resilient property of visual encoder attention, and the establishment of a novel decoding-time visual alignment paradigm grounded in its guided steering.
📝 Abstract
Large vision-language models (LVLMs) exhibit impressive ability to jointly reason over visual and textual inputs. However, they often produce outputs that are linguistically fluent but factually inconsistent with the visual evidence, i.e., they hallucinate. Despite growing efforts to mitigate such hallucinations, a key question remains: what form of visual attention can effectively suppress hallucinations during decoding? In this work, we provide a simple answer: the vision encoder's own attention map. We show that LVLMs tend to hallucinate when their final visual-attention maps fail to concentrate on key image objects, whereas the vision encoder's more concentrated attention maps substantially reduce hallucinations. To further investigate the cause, we analyze vision-text conflicts during decoding and find that these conflicts peak in the language model's middle layers. Injecting the vision encoder's attention maps into these layers effectively suppresses hallucinations. Building on these insights, we introduce VEGAS, a simple yet effective inference-time method that integrates the vision encoder's attention maps into the language model's mid-layers and adaptively steers tokens which fail to concentrate on key image objects. Extensive experiments across multiple benchmarks demonstrate that VEGAS consistently achieves state-of-the-art performance in reducing hallucinations.