Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of diagnostic errors in large language models (LLMs) during multi-turn medical consultations, which often stem from incomplete patient information and a lack of effective evaluation of the relationship between model confidence and correctness. To this end, we construct the first benchmark specifically designed for confidence assessment in real-world multi-turn medical dialogues and propose MedConf, a novel framework that leverages retrieval-augmented generation to build symptom profiles. MedConf aligns patient information through support, missing, and contradiction relations and introduces an interpretable evidence-weighting mechanism for evidence-based self-assessment. By jointly modeling diagnostic accuracy and information completeness, our approach establishes a dynamic paradigm for confidence–correctness evaluation. Experiments demonstrate that MedConf consistently outperforms existing methods across two LLMs and three medical datasets, maintaining robust performance especially under conditions of insufficient information and comorbidities.

Technology Category

Application Category

📝 Abstract

Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and consistency-level confidence methods, and (2) medical reasoning must be evaluated for both diagnostic accuracy and information completeness. Based on these insights, we present MedConf, an evidence-grounded linguistic self-assessment framework that constructs symptom profiles via retrieval-augmented generation, aligns patient information with supporting, missing, and contradictory relations, and aggregates them into an interpretable confidence estimate through weighted integration. Across two LLMs and three medical datasets, MedConf consistently outperforms state-of-the-art methods on both AUROC and Pearson correlation coefficient metrics, maintaining stable performance under conditions of information insufficiency and multimorbidity. These results demonstrate that information adequacy is a key determinant of credible medical confidence modeling, providing a new pathway toward building more reliable and interpretable large medical models.

Problem

Research questions and friction points this paper is trying to address.

medical LLMs

confidence estimation

multi-turn consultation

information sufficiency

diagnostic reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence estimation

medical LLMs

multi-turn consultation