Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from hallucination in image understanding—generating coherent yet semantically inconsistent text with respect to visual content (objects, attributes, relations). We observe that shallow visual encoder features induce significantly more hallucinations than deep features. To address this, we propose **Layer-wise Contrastive Decoding (LCD)**: a training-free, inference-time decoding strategy that dynamically contrasts the language output distributions induced by shallow and deep visual features, thereby suppressing hallucinations rooted in low-level visual biases. LCD is architecture-agnostic and requires no model modification or fine-tuning. Evaluated on two authoritative hallucination benchmarks, LCD consistently outperforms state-of-the-art methods, reducing hallucination rates by 12.7%–23.4% across object, attribute, and relational dimensions. This demonstrates both its effectiveness and strong generalization across diverse MLLM architectures and hallucination types.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .

Problem

Research questions and friction points this paper is trying to address.

Mitigating hallucinations in multimodal large language models

Filtering inaccurate object, attribute, and relation descriptions

Contrasting shallow and deep visual feature distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrasts shallow and deep visual feature distributions

Filters hallucinations using layer-wise output comparisons

Leverages vision encoder hierarchy for improved reasoning

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey