Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

📅 2024-10-17

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from visual hallucination, primarily due to overreliance on linguistic priors—a problem exacerbated in long-context generation. This paper proposes Summary-Guided Decoding (SumGD) to mitigate hallucination through three key innovations: (1) the first empirical identification of a monotonic increase in linguistic prior strength with context length; (2) a non-distributional calibration mechanism that selectively suppresses hallucinated tokens—specifically image-relevant part-of-speech tokens—without architectural modification or model decoupling; and (3) a dual strategy combining summary-based context compression and vision-language alignment decoding to enhance visual grounding. Extensive experiments demonstrate that SumGD achieves state-of-the-art performance on object hallucination benchmarks, attaining a Pareto-optimal trade-off between precision and recall while significantly improving robustness and textual fidelity.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in generating detailed and coherent responses from visual inputs. However, they are prone to generate hallucinations due to an over-reliance on language priors. To address this issue, we investigate the language priors in LVLMs and make two key observations: (1) Even when predicting the tokens associated with image-related part-of-speech (POS), models increasingly rely on linguistic priors as the token sequences grow, thereby amplifying hallucinations. (2) Methods that directly calibrate LVLM's output distribution to mitigate language priors can lead to a degradation in text quality or even exacerbate hallucinations. Based on these findings, we propose a novel method, Summary-Guided Decoding (SumGD). This method naturally encourages the model to focus more on image information by reducing the text context through summaries, while controlling only the image-related POS tokens to maintain text quality. Through experiments, we demonstrate that SumGD achieves state-of-the-art performance on object hallucination benchmarks. Furthermore, in terms of the trade-off between precision and recall, SumGD achieves Pareto optimality among the existing methods. Lastly, we observe that although existing methods struggle to balance the reduction of object hallucinations with maintaining text quality, SumGD demonstrates robustness in handling this challenge.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in Vision-Language Models

Balances image focus with text quality

Improves precision-recall trade-off in outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Summary-Guided Decoding reduces hallucinations.

Focuses on image-related POS tokens.

Achieves Pareto optimality in precision-recall.

🔎 Similar Papers

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models