Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large vision-language models (LVLMs) frequently generate hallucinations due to visual–textual semantic misalignment. Existing training-free decoding methods suffer from static constraints, high computational overhead, and degradation of fine-grained details. To address this, we propose Dynamic Logits Calibration (DLC), a novel inference-time framework that introduces a stepwise CLIP-based semantic alignment mechanism. DLC defines a Relative Visual Advantage (RVA) metric and adaptively reweights logits over a dynamically updated context baseline, enabling real-time alignment between visual evidence and textual generation. Crucially, DLC requires no model fine-tuning or additional forward passes, ensuring compatibility with diverse LVLM architectures—including LLaVA, InstructBLIP, and MiniGPT-4—while preserving high inference efficiency. Experimental results demonstrate that DLC significantly reduces hallucination rates compared to state-of-the-art training-free decoding strategies, without compromising generation quality or speed.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.

Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in LVLMs during text generation

Dynamic alignment of text with visual evidence

Improving efficiency without multiple forward passes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Logits Calibration for alignment

Relative Visual Advantage evaluation

Adaptive weighting for visual guidance

🔎 Similar Papers

No similar papers found.