🤖 AI Summary
In medical visual question answering (VQA), multimodal large language models (MLLMs) frequently generate hallucinations inconsistent with the input image, undermining clinical reliability. Existing hallucination detection methods based on semantic entropy (SE) face a fundamental trade-off in medical settings: weak visual perturbations fail to sufficiently activate the model’s visual dependency, whereas strong perturbations distort diagnostically critical features. To address this, we propose VASE—the first hallucination detection metric jointly optimized for clinical validity and visual sensitivity. VASE preserves diagnostic content via mild, semantically preserving image transformations; introduces a cross-image conditional distribution contrast amplification mechanism to strengthen the influence of visual input on semantic predictions; and refines semantic entropy estimation. Evaluated on two open medical VQA benchmarks, VASE achieves up to 12.7% higher AUC than state-of-the-art methods—without model fine-tuning or additional annotations.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.