Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models

📅 2025-05-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large vision-language models (VLMs) frequently exhibit hallucinations—i.e., generated diagnostic reports inconsistent with gastrointestinal (GI) image content—hindering clinical reliability. Method: We introduce Gut-VLM, the first hallucination-aware multimodal benchmark for GI imaging, built upon the Kvasir-v2 dataset and annotated by medical experts via double-blind review for both hallucination presence and clinically validated corrections. We propose a hallucination-aware fine-tuning paradigm that jointly optimizes report generation, hallucination detection, and correction, leveraging multi-stage prompt engineering and domain-specific adaptation. Contribution/Results: Evaluated on multiple state-of-the-art VLMs, our approach significantly reduces hallucination rates and improves clinical accuracy. Furthermore, we establish a comprehensive evaluation framework incorporating hallucination rate, diagnostic consistency, and other clinically relevant metrics—setting a new standard for robust, trustworthy VLM deployment in GI diagnostics.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are becoming increasingly popular in the medical domain, bridging the gap between medical images and clinical language. Existing VLMs demonstrate an impressive ability to comprehend medical images and text queries to generate detailed, descriptive diagnostic medical reports. However, hallucination--the tendency to generate descriptions that are inconsistent with the visual content--remains a significant issue in VLMs, with particularly severe implications in the medical field. To facilitate VLM research on gastrointestinal (GI) image analysis and study hallucination, we curate a multimodal image-text GI dataset: Gut-VLM. This dataset is created using a two-stage pipeline: first, descriptive medical reports of Kvasir-v2 images are generated using ChatGPT, which introduces some hallucinated or incorrect texts. In the second stage, medical experts systematically review these reports, and identify and correct potential inaccuracies to ensure high-quality, clinically reliable annotations. Unlike traditional datasets that contain only descriptive texts, our dataset also features tags identifying hallucinated sentences and their corresponding corrections. A common approach to reducing hallucination in VLM is to finetune the model on a small-scale, problem-specific dataset. However, we take a different strategy using our dataset. Instead of finetuning the VLM solely for generating textual reports, we finetune it to detect and correct hallucinations, an approach we call hallucination-aware finetuning. Our results show that this approach is better than simply finetuning for descriptive report generation. Additionally, we conduct an extensive evaluation of state-of-the-art VLMs across several metrics, establishing a benchmark. GitHub Repo: https://github.com/bhattarailab/Hallucination-Aware-VLM.
Problem

Research questions and friction points this paper is trying to address.

Addressing hallucination in Vision-Language Models for medical imaging
Creating a multimodal GI dataset with corrected annotations
Proposing hallucination-aware finetuning to improve diagnostic accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline creates hallucination-labeled GI dataset
Hallucination-aware finetuning improves VLM diagnostic accuracy
Multimodal benchmark evaluates VLM performance metrics
🔎 Similar Papers
No similar papers found.
Bidur Khanal
Bidur Khanal
Rochester Institute of Technology
Machine LearningDeep LearningMedical Image AnalysisComputer Vision
S
Sandesh Pokhrel
Nepal Applied Mathematics and Informatics Institute for Research (NAAMII), Nepal
S
Sanjay Bhandari
Nepal Applied Mathematics and Informatics Institute for Research (NAAMII), Nepal
R
Ramesh Rana
Kathmandu University, Nepal
N
Nikesh Shrestha
Kathmandu University, Nepal
R
Ram Bahadur Gurung
Kathmandu University, Nepal
C
Cristian Linte
Rochester Institute of Technology, USA
A
Angus Watson
University of Aberdeen, UK
Yash Raj Shrestha
Yash Raj Shrestha
University of Lausanne, Applied AI Lab
Applied AIHuman - AI CollaborationData Driven Decision-MakingOrganization Design
Binod Bhattarai
Binod Bhattarai
Assistant Professor, University of Aberdeen
Machine LearningMedical Image AnalysisComputer Vision