Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) deployed in high-stakes medical settings require well-calibrated uncertainty estimates to ensure safe, trustworthy decision support—yet their calibration in clinical domains remains poorly characterized. Method: We systematically evaluate the calibration of eight mainstream LLMs (e.g., GPT, Claude, Llama) on 300 board-style gastroenterology examination questions, using Brier score, AUROC, and domain-specific question difficulty metrics. This is the first study to comparatively assess commercial, open-weight, and quantized models on medical uncertainty expression. Contribution/Results: All models exhibit significant overconfidence: the best-performing model (GPT-o1 preview) achieves only a Brier score of 0.15–0.20 and AUROC of 0.60—substantially below ideal calibration. These findings reveal a critical gap in uncertainty quantification for clinical LLMs, posing a major safety bottleneck for real-world deployment. The study establishes the first cross-architectural, standardized empirical benchmark for evaluating trustworthiness of medical AI systems.

Technology Category

Application Category

📝 Abstract
This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification
Problem

Research questions and friction points this paper is trying to address.

Evaluating self-reported confidence of LLMs in gastroenterology questions
Assessing performance and overconfidence trends in various LLM models
Addressing uncertainty estimation challenges for LLMs in healthcare
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated self-reported confidence in gastroenterology questions
Used Brier scores and AUROC for performance metrics
Identified overconfidence as a key challenge in LLMs
🔎 Similar Papers
No similar papers found.