Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Large language models (LLMs) deployed in high-stakes medical settings require well-calibrated uncertainty estimates to ensure safe, trustworthy decision support—yet their calibration in clinical domains remains poorly characterized. Method: We systematically evaluate the calibration of eight mainstream LLMs (e.g., GPT, Claude, Llama) on 300 board-style gastroenterology examination questions, using Brier score, AUROC, and domain-specific question difficulty metrics. This is the first study to comparatively assess commercial, open-weight, and quantized models on medical uncertainty expression. Contribution/Results: All models exhibit significant overconfidence: the best-performing model (GPT-o1 preview) achieves only a Brier score of 0.15–0.20 and AUROC of 0.60—substantially below ideal calibration. These findings reveal a critical gap in uncertainty quantification for clinical LLMs, posing a major safety bottleneck for real-world deployment. The study establishes the first cross-architectural, standardized empirical benchmark for evaluating trustworthiness of medical AI systems.

Technology Category

Application Category

📝 Abstract

This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

Problem

Research questions and friction points this paper is trying to address.

Evaluating self-reported confidence of LLMs in gastroenterology questions

Assessing performance and overconfidence trends in various LLM models

Addressing uncertainty estimation challenges for LLMs in healthcare

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated self-reported confidence in gastroenterology questions

Used Brier scores and AUROC for performance metrics

Identified overconfidence as a key challenge in LLMs

🔎 Similar Papers

No similar papers found.