BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination in large vision-language models (VLMs) caused by vision–text misalignment, this paper proposes a joint distribution calibration framework grounded in bijectivity and maximum-likelihood learning. Methodologically, it introduces invertible normalizing flows—first applied to VLM hallucination mitigation—to establish a differentiable, interpretable, and invertible image–text alignment mechanism that imposes trustworthiness constraints during decoding, thereby transcending conventional post-hoc correction or supervised fine-tuning paradigms. On the POPE benchmark, the method achieves an average F1 score of 85.06%, with CHAIRS and CHAIRI hallucination metrics reduced by 7.6% and 2.6%, respectively, demonstrating substantial improvements in generation faithfulness and interpretability. The core contribution is the first bijective-mapping-based hallucination calibration paradigm for VLMs, unifying joint vision–language distribution modeling while guaranteeing decoding consistency.

Technology Category

Application Category

📝 Abstract
Large vision-language models have become widely adopted to advance in various domains. However, developing a trustworthy system with minimal interpretable characteristics of large-scale models presents a significant challenge. One of the most prevalent terms associated with the fallacy functions caused by these systems is hallucination, where the language model generates a response that does not correspond to the visual content. To mitigate this problem, several approaches have been developed, and one prominent direction is to ameliorate the decoding process. In this paper, we propose a new Bijective Maximum Likelihood Learning (BIMA) approach to hallucination mitigation using normalizing flow theories. The proposed BIMA method can efficiently mitigate the hallucination problem in prevailing vision-language models, resulting in significant improvements. Notably, BIMA achieves the average F1 score of 85.06% on POPE benchmark and remarkably reduce CHAIRS and CHAIRI by 7.6% and 2.6%, respectively. To the best of our knowledge, this is one of the first studies that contemplates the bijection means to reduce hallucination induced by large vision-language models.
Problem

Research questions and friction points this paper is trying to address.

Mitigate hallucination in vision-language models
Improve decoding process reliability
Enhance model interpretability and trust
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bijective Maximum Likelihood Learning Approach
Normalizing flow theories for hallucination mitigation
Improves F1 score and reduces CHAIRS metrics