Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

📅 2024-02-13

🏛️ arXiv.org

📈 Citations: 31

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) frequently generate object hallucinations—i.e., describing non-existent objects—in image captioning. To address this, we propose MARINE: a training-free, API-free, inference-time intervention framework that mitigates hallucinations in real time. MARINE fuses localization features from open-source vision models (e.g., DINOv2 and SAM) and dynamically recalibrates cross-modal attention via classifier-free guidance, suppressing spurious object generation without modifying model parameters or relying on proprietary services. Evaluated on six mainstream LVLMs, MARINE significantly reduces hallucination rates (as assessed by GPT-4V) while enhancing descriptive detail richness—outperforming existing fine-tuning-based methods in overall caption quality. This work is the first to apply classifier-free guidance for hallucination suppression in LVLMs and to synergistically integrate multi-source visual feature distillation with inference-time intervention for robust, parameter-efficient mitigation.

Technology Category

Application Category

📝 Abstract

The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model's output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs' generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs' generations, as assessed by GPT-4V.

Problem

Research questions and friction points this paper is trying to address.

Reducing object hallucination in vision-language models

Eliminating costly training or API dependencies

Enhancing precision with image-grounded object guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework reduces object hallucination

Leverages open-source vision models for object information

Integrates multiple vision models for robust guidance

🔎 Similar Papers

From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models