🤖 AI Summary
Existing vision-language models (VLMs) suffer from strong linguistic priors, leading to hallucinated responses inconsistent with image content. To address this, we propose a **training-free, plug-and-play cross-modal guided decoding method**. Our approach introduces an adaptive masking mechanism that selectively attenuates the most influential image-token attentions in critical Transformer layers, inducing explicit vision–language perceptual degradation to suppress language bias. Subsequently, decoding is guided by the attention degradation disparity, thereby enhancing reliance on visual context. Crucially, the method requires no architectural modifications or parameter updates, preserving original inference efficiency. Evaluated across multiple hallucination benchmarks, it significantly reduces hallucination rates while maintaining compatibility with diverse mainstream VLMs. The approach demonstrates strong generalizability and practical utility without compromising model integrity or speed.
📝 Abstract
Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.