Context-Aware Decoding for Faithful Vision-Language Generation

πŸ“… 2026-01-09
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the prevalent issue of hallucination in large vision-language models (LVLMs), which often generate content inconsistent with input images during open-ended generation. The study uncovers a novel phenomenon termed β€œcommitment depth gap”: tokens corresponding to factual content converge earlier in the decoding process than hallucinated ones. Leveraging this insight, the authors propose Contextual Embedding Injection (CEI), a training-free method that dynamically injects the contextual embedding from the end of the input sequence as a visual anchor during decoding to suppress hallucinations. Extensive experiments demonstrate that CEI significantly improves generation faithfulness across CHAI, AMBER, and MMHal-Bench benchmarks. It consistently outperforms existing approaches in three prominent LVLMs, with its dynamic variant achieving the lowest overall hallucination rate.

Technology Category

Application Category

πŸ“ Abstract
Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
Problem

Research questions and friction points this paper is trying to address.

hallucinations
vision-language models
visual fidelity
image captioning
visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Embedding Injection
hallucination mitigation
Logit Lens
commitment-depth gap
vision-language models
πŸ”Ž Similar Papers
No similar papers found.
M
Mehrdad Fazli
Department of Computer Science, George Mason University
B
Bowen Wei
Department of Computer Science, George Mason University
Ziwei Zhu
Ziwei Zhu
Assistant Professor at George Mason University
data mininginformation retrievalmachine learningresponsible AI