Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In the evaluation of medical imaging AI, a systematic understanding of confidence interval behavior is lacking, limiting the reliability of results and their clinical translation. This study presents the first large-scale empirical analysis of confidence interval reliability (coverage) and precision (width) across 24 segmentation and classification tasks, leveraging 19 models, multiple performance metrics, aggregation strategies, and mainstream confidence interval methods. The findings reveal that confidence interval performance is substantially influenced by the choice of performance metric, aggregation method, task type, and sample size, with required sample sizes ranging from dozens to thousands and exhibiting marked differences across methods. This work provides critical evidence and practical guidance for reporting uncertainty in medical AI performance evaluations.

Technology Category

Application Category

📝 Abstract
Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated across all settings to characterize their dependence on study characteristics. Our analysis revealed five principal findings: 1) the sample size required for reliable CIs varies from a few dozens to several thousands of cases depending on study parameters; 2) CI behavior is strongly affected by the choice of performance metric; 3) aggregation strategy substantially influences the reliability of CIs, e.g. they require more observations for macro than for micro; 4) the machine learning problem (segmentation versus classification) modulates these effects; 5) different CI methods are not equally reliable and precise depending on the use case. These results form key components for the development of future guidelines on reporting performance uncertainty in medical imaging AI.
Problem

Research questions and friction points this paper is trying to address.

performance uncertainty
confidence intervals
medical image analysis
AI validation
clinical translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence intervals
medical image analysis
performance uncertainty
empirical evaluation
AI validation
🔎 Similar Papers
No similar papers found.
P
Pascaline André
Sorbonne Université, Institut du Cerveau – Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris, France
C
Charles Heitz
Sorbonne Université, Institut du Cerveau – Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris, France
E
Evangelia Christodoulou
Unit for Lifelong Health and Ageing at UCL, Department of Population Science and Experimental Medicine and Hawkes Institute Centre for Medical Image Computing, Department of Computer Science, University College London, UK
A
Annika Reinke
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany
C
Carole H. Sudre
Unit for Lifelong Health and Ageing at UCL, Department of Population Science and Experimental Medicine and Hawkes Institute Centre for Medical Image Computing, Department of Computer Science, University College London, UK
M
Michela Antonelli
School of Biomedical Engineering and Imaging Science, King’s College London, UK
P
Patrick Godau
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany
M. Jorge Cardoso
M. Jorge Cardoso
Reader, King's College London
Medical Image AnalysisArtificial IntelligenceMachine Learning
A
Antoine Gilson
Sorbonne Université, Institut du Cerveau – Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris, France
S
Sophie Tezenas du Montcel
Sorbonne Université, Institut du Cerveau – Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, F-75013, Paris, France
Gaël Varoquaux
Gaël Varoquaux
Research director, INRIA
Machine learningtabular AIneurosciencemedical statistics
L
Lena Maier-Hein
German Cancer Research Center (DKFZ) Heidelberg, Div. Intelligent Medical Systems, Germany
Olivier Colliot
Olivier Colliot
Research Director at CNRS, ARAMIS Lab
machine learningimage analysismedical imagingneuroimagingbrain disorders