Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

📅 2024-08-04
🏛️ arXiv.org
📈 Citations: 8
Influential: 1
📄 PDF
🤖 AI Summary
To address pervasive visual hallucinations in large vision-language models (LVLMs), this paper proposes a training-free, knowledge-external introspective inference-time correction method. The approach operates entirely within the model’s internal representations without fine-tuning or external resources. Its core contributions are threefold: (1) token-level visual token importance scoring derived from the model’s self-attention weights; (2) a context- and text-aware visual token selection (CT2S) strategy that precisely identifies and adaptively prunes hallucination sources; and (3) hallucination subtraction in the logits space via vision–text alignment modeling. Evaluated across multiple visual hallucination benchmarks, the method substantially reduces hallucination rates while improving generation fidelity and quality. It incurs zero training overhead and introduces only negligible computational overhead—less than 2% latency increase—making it highly practical for real-world LVLM deployment.

Technology Category

Application Category

📝 Abstract
While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.
Problem

Research questions and friction points this paper is trying to address.

Addresses hallucination in Large Vision-Language Models.
Proposes Self-Introspective Decoding to reduce hallucinations.
Enhances text quality without extra computational costs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Introspective Decoding reduces hallucinations adaptively.
CT2S strategy preserves unimportant vision tokens early.
Subtracts amplified hallucinations from original token logits.
🔎 Similar Papers
No similar papers found.