Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) frequently exhibit object hallucination—generating descriptions containing objects absent from the input image. This paper identifies the root cause as overconfident mapping of soft visual tokens into the language embedding space, wherein irrelevant visual features are erroneously activated. To address this, we establish, for the first time, a formal link between semantic similarity distribution smoothness and hallucination propensity. We propose Adaptive Variational Information Bottleneck (AdaVIB), a novel regularization framework that dynamically constrains information flow via entropy-driven stochastic noise injection. AdaVIB preserves task-critical visual semantics while suppressing spurious feature activation. Compatible with mainstream VLM architectures, it achieves significant reductions in hallucination rates on two established object hallucination benchmarks, markedly improving description faithfulness. Our approach introduces a principled, information-theoretic paradigm for enhancing the reliability and trustworthiness of VLM-generated outputs.

Technology Category

Application Category

📝 Abstract
Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Mitigates object hallucination in vision-language models
Reduces overconfidence in irrelevant visual features
Adaptively controls noise to constrain information flow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive noise injection using Variational Information Bottleneck
Entropy-based strategy controls noise adaptively
Improves object hallucination benchmarks consistently
Jiaqi Bai
Jiaqi Bai
Beihang University
Natural Language ProcessingInformation RetrievalLarge Language Model
Hongcheng Guo
Hongcheng Guo
School of Data Science, Fudan University
LLMsMultimodal LLMs
Zhongyuan Peng
Zhongyuan Peng
Fudan University
LLM
J
Jian Yang
CCSE, Beihang University, China
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security
M
Mohan Li
Cyberspace Institute of Advanced Technology, Guangzhou University, China; Huangpu Research School of Guangzhou University, China
Z
Zhihong Tian
Cyberspace Institute of Advanced Technology, Guangzhou University, China; Huangpu Research School of Guangzhou University, China