Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses safety concerns in medical vision-language models stemming from poor confidence calibration and sensitivity to question rephrasing. The authors systematically evaluate multiple uncertainty quantification methods and, for the first time, demonstrate that both calibration error and rephrasing sensitivity arise from similar uncertainties in decision boundaries. They show that predictive entropy from a single forward pass can efficiently detect both types of risks without requiring complex ensembles. Extensive cross-distribution and cross-architecture experiments on MIMIC-CXR and PadChest reveal that total entropy outperforms ensemble-based approaches in error detection (AUROC 0.743) and rephrasing screening, while Monte Carlo Dropout achieves the best calibration (ECE 4.3) and selective prediction performance (21.5% coverage at 5% risk).

Technology Category

Application Category

📝 Abstract
Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.
Problem

Research questions and friction points this paper is trying to address.

calibration
paraphrase sensitivity
medical vision-language models
predictive entropy
uncertainty quantification
Innovation

Methods, ideas, or system contributions that make the work stand out.

predictive entropy
calibration
paraphrase sensitivity
uncertainty quantification
medical vision-language models
🔎 Similar Papers
No similar papers found.