ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from severe hallucination during practical deployment, and existing mitigation strategies rely on multi-round querying, failing to meet real-time inference requirements. This paper proposes a training-free, single-pass decoding intervention method that dynamically reweights critical text tokens in a single intermediate layer of the Transformer decoder. Leveraging a text-to-vision entropy ratio mechanism, it enhances cross-modal alignment and suppresses erroneous generation without iterative inference. Unlike conventional contrastive decoding—requiring multiple forward passes—our approach incurs minimal computational overhead. Evaluated across multiple hallucination benchmarks (e.g., POPE, MME-Hallu), it consistently outperforms state-of-the-art methods with negligible latency overhead, reducing average hallucination rate by 12.6% and improving accuracy by 4.3%. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our proposed ONLY consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost. Code is available at https://github.com/zifuwan/ONLY.

Problem

Research questions and friction points this paper is trying to address.

Mitigates hallucinations in Large Vision-Language Models

Reduces need for multiple queries in decoding

Enables efficient real-time deployment with minimal cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-layer intervention for hallucination mitigation

Single query training-free decoding approach

Text-to-visual entropy ratio enhancement

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models