When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing vision-language models (VLMs) suffer from strong linguistic priors, leading to hallucinated responses inconsistent with image content. To address this, we propose a **training-free, plug-and-play cross-modal guided decoding method**. Our approach introduces an adaptive masking mechanism that selectively attenuates the most influential image-token attentions in critical Transformer layers, inducing explicit vision–language perceptual degradation to suppress language bias. Subsequently, decoding is guided by the attention degradation disparity, thereby enhancing reliance on visual context. Crucially, the method requires no architectural modifications or parameter updates, preserving original inference efficiency. Evaluated across multiple hallucination benchmarks, it significantly reduces hallucination rates while maintaining compatibility with diverse mainstream VLMs. The approach demonstrates strong generalizability and practical utility without compromising model integrity or speed.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.

Problem

Research questions and friction points this paper is trying to address.

Mitigating language bias-induced hallucinations in Vision-Language Models

Addressing irrelevant text generation despite visual context

Reducing visual-language attention bias without training requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free decoding reduces language bias

Attention masking degrades visual-language perception

Cross-modal guidance emphasizes visual context perception

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey