🤖 AI Summary
Multimodal large language models (MLLMs) suffer from hallucination in image understanding—generating coherent yet semantically inconsistent text with respect to visual content (objects, attributes, relations). We observe that shallow visual encoder features induce significantly more hallucinations than deep features. To address this, we propose **Layer-wise Contrastive Decoding (LCD)**: a training-free, inference-time decoding strategy that dynamically contrasts the language output distributions induced by shallow and deep visual features, thereby suppressing hallucinations rooted in low-level visual biases. LCD is architecture-agnostic and requires no model modification or fine-tuning. Evaluated on two authoritative hallucination benchmarks, LCD consistently outperforms state-of-the-art methods, reducing hallucination rates by 12.7%–23.4% across object, attribute, and relational dimensions. This demonstrates both its effectiveness and strong generalization across diverse MLLM architectures and hallucination types.
📝 Abstract
Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .