🤖 AI Summary
Large language models (LLMs) deployed in high-stakes medical settings require well-calibrated uncertainty estimates to ensure safe, trustworthy decision support—yet their calibration in clinical domains remains poorly characterized.
Method: We systematically evaluate the calibration of eight mainstream LLMs (e.g., GPT, Claude, Llama) on 300 board-style gastroenterology examination questions, using Brier score, AUROC, and domain-specific question difficulty metrics. This is the first study to comparatively assess commercial, open-weight, and quantized models on medical uncertainty expression.
Contribution/Results: All models exhibit significant overconfidence: the best-performing model (GPT-o1 preview) achieves only a Brier score of 0.15–0.20 and AUROC of 0.60—substantially below ideal calibration. These findings reveal a critical gap in uncertainty quantification for clinical LLMs, posing a major safety bottleneck for real-world deployment. The study establishes the first cross-architectural, standardized empirical benchmark for evaluating trustworthiness of medical AI systems.
📝 Abstract
This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification