Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

In medical visual question answering (VQA), multimodal large language models (MLLMs) frequently generate hallucinations inconsistent with the input image, undermining clinical reliability. Existing hallucination detection methods based on semantic entropy (SE) face a fundamental trade-off in medical settings: weak visual perturbations fail to sufficiently activate the model’s visual dependency, whereas strong perturbations distort diagnostically critical features. To address this, we propose VASE—the first hallucination detection metric jointly optimized for clinical validity and visual sensitivity. VASE preserves diagnostic content via mild, semantically preserving image transformations; introduces a cross-image conditional distribution contrast amplification mechanism to strengthen the influence of visual input on semantic predictions; and refines semantic entropy estimation. Evaluated on two open medical VQA benchmarks, VASE achieves up to 12.7% higher AUC than state-of-the-art methods—without model fine-tuning or additional annotations.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.

Problem

Research questions and friction points this paper is trying to address.

Detect hallucinations in medical VQA responses

Balance visual perturbations for clinical validity

Improve accuracy of hallucination detection methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Amplified Semantic Entropy (VASE)

Weak image transformations for clinical validity

Contrasting distributions to amplify visual influence

🔎 Similar Papers

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification