🤖 AI Summary
Large Vision-Language Models (LVLMs) frequently generate object hallucinations—i.e., describing non-existent objects—in image captioning. To address this, we propose MARINE: a training-free, API-free, inference-time intervention framework that mitigates hallucinations in real time. MARINE fuses localization features from open-source vision models (e.g., DINOv2 and SAM) and dynamically recalibrates cross-modal attention via classifier-free guidance, suppressing spurious object generation without modifying model parameters or relying on proprietary services. Evaluated on six mainstream LVLMs, MARINE significantly reduces hallucination rates (as assessed by GPT-4V) while enhancing descriptive detail richness—outperforming existing fine-tuning-based methods in overall caption quality. This work is the first to apply classifier-free guidance for hallucination suppression in LVLMs and to synergistically integrate multi-source visual feature distillation with inference-time intervention for robust, parameter-efficient mitigation.
📝 Abstract
The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model's output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs' generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs' generations, as assessed by GPT-4V.