Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Black-box vision-language models (VLMs) frequently generate hallucinations in radiology visual question answering (VQA), compromising clinical reliability. Method: We propose a non-intrusive, question-level filtering method based on Discrete Semantic Entropy (DSE), which quantifies semantic inconsistency without accessing model internals. DSE is computed via high-temperature multi-sample generation followed by bidirectional entailment checking for semantic clustering, augmented by bootstrap resampling and confidence interval estimation. Contribution/Results: The approach is model-agnostic and clinically deployable. Evaluated on 706 radiology image–question pairs, filtering questions with DSE > 0.3 significantly improved accuracy: GPT-4o increased from 51.7% to 76.3%, and GPT-4.1 from 54.8% to 63.8%—both improvements statistically significant (p < 0.001). This demonstrates DSE’s effectiveness in mitigating hallucination-prone queries while preserving interpretability and practical utility in medical AI applications.

Technology Category

Application Category

📝 Abstract
To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: (i) the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and (ii) a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in radiology vision-language models using semantic entropy
Improving diagnostic accuracy of black-box VLMs in medical visual question answering
Filtering unreliable questions to enhance clinical VLM application reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete semantic entropy for hallucination detection
Filters high-entropy questions to improve VLM accuracy
Quantifies semantic inconsistency in black-box vision-language models
P
Patrick Wienholt
Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
S
Sophie Caselitz
Lab for Artificial Intelligence in Medicine, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
Robert Siepmann
Robert Siepmann
Department of Diagnostic and Interventional Radiology, University Hospital Aachen
Philipp Bruners
Philipp Bruners
Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
Keno Bressem
Keno Bressem
Technical University Munich
deep learningradiomicsmicrowave ablation
C
Christiane Kuhl
Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
J
Jakob Nikolas Kather
Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
Sven Nebelung
Sven Nebelung
Department of Diagnostic and Interventional Radiology, University Hospital Aachen
Advanced MRI TechniquesFunctionality AssessmentBiomechanical ImagingCartilageArtificial Intelligence
Daniel Truhn
Daniel Truhn
Professor of Radiology, University Hospital Aachen
Machine LearningArtificial IntelligenceComputer VisionMedical Imaging