When Images Speak Louder: Mitigating Language Bias-induced Hallucinations in VLMs through Cross-Modal Guidance

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) suffer from strong linguistic priors, leading to hallucinated responses inconsistent with image content. To address this, we propose a **training-free, plug-and-play cross-modal guided decoding method**. Our approach introduces an adaptive masking mechanism that selectively attenuates the most influential image-token attentions in critical Transformer layers, inducing explicit vision–language perceptual degradation to suppress language bias. Subsequently, decoding is guided by the attention degradation disparity, thereby enhancing reliance on visual context. Crucially, the method requires no architectural modifications or parameter updates, preserving original inference efficiency. Evaluated across multiple hallucination benchmarks, it significantly reduces hallucination rates while maintaining compatibility with diverse mainstream VLMs. The approach demonstrates strong generalizability and practical utility without compromising model integrity or speed.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have shown solid ability for multimodal understanding of both visual and language contexts. However, existing VLMs often face severe challenges of hallucinations, meaning that VLMs tend to generate responses that are only fluent in the language but irrelevant to images in previous contexts. To address this issue, we analyze how language bias contributes to hallucinations and then introduce Cross-Modal Guidance(CMG), a training-free decoding method that addresses the hallucinations by leveraging the difference between the output distributions of the original model and the one with degraded visual-language attention. In practice, we adaptively mask the attention weight of the most influential image tokens in selected transformer layers to corrupt the visual-language perception as a concrete type of degradation. Such a degradation-induced decoding emphasizes the perception of visual contexts and therefore significantly reduces language bias without harming the ability of VLMs. In experiment sections, we conduct comprehensive studies. All results demonstrate the superior advantages of CMG with neither additional conditions nor training costs. We also quantitatively show CMG can improve different VLM's performance on hallucination-specific benchmarks and generalize effectively.
Problem

Research questions and friction points this paper is trying to address.

Mitigating language bias-induced hallucinations in Vision-Language Models
Addressing irrelevant text generation despite visual context
Reducing visual-language attention bias without training requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free decoding reduces language bias
Attention masking degrades visual-language perception
Cross-modal guidance emphasizes visual context perception
🔎 Similar Papers
No similar papers found.
J
Jinjin Cao
MAPLE Lab, Westlake University
Z
Zhiyang Chen
MAPLE Lab, Westlake University
Z
Zijun Wang
MAPLE Lab, Westlake University
Liyuan Ma
Liyuan Ma
zhejiang university
image synthesis, generative modelGANDiffusion Model
Weijian Luo
Weijian Luo
Peking University
Human-preferred Generative ModelsLarge Vision-language Models
G
Guojun Qi
MAPLE Lab, Westlake University