🤖 AI Summary
This work identifies the root cause of frequent hallucinations in large vision-language models (LVLMs) under cognitively demanding prompts: a fundamental decoupling between visual perception and linguistic reasoning—models recognize visual elements but fail to effectively ground them in internal knowledge for deep inference. To address this, we propose Vision-Description Guided Decoding (VDGD), a training-free intervention that anchors decoding via injected image-descriptive prefixes and enforces alignment between perceptual representations and reasoning through KL-divergence regularization during autoregressive generation. Integrating prompt engineering with the novel VaLLu cognitive evaluation benchmark—which spans causal, analogical, and multi-step reasoning—we demonstrate VDGD’s efficacy across diverse visual reasoning benchmarks. It consistently improves performance by 2–33%, substantially reduces hallucinations, and achieves, for the first time, lightweight, perception-reasoning co-alignment without architectural modification or fine-tuning.
📝 Abstract
Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts-those that require simple descriptions of visual elements-but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.