Sequential sample size calculations and learning curves safeguard the robust development of a clinical prediction model for individuals

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In clinical prediction model (CPM) development, conventional fixed sample size calculations—based on a priori assumptions—are prone to underestimation due to assumption violations, jeopardizing model stability and individual-level prediction reliability. To address this, we propose a sequential sample size determination method grounded in learning curve analysis. Our approach uniquely employs individual prediction uncertainty and classification instability as dynamic stopping criteria, integrated with optimism-corrected calibration and discrimination assessment for real-time monitoring of overfitting, calibration, and discriminative performance. The method combines logistic regression, bootstrap resampling, and sequential learning curve evaluation. Validated in acute kidney injury prediction modeling, it revealed that while conventional methods recommended 342 cases, our approach required 1,100–1,800 cases to simultaneously ensure robust population-level performance and reliable individual predictions—substantially enhancing the scientific rigor and reproducibility of CPM development.

Technology Category

Application Category

📝 Abstract
When prospectively developing a new clinical prediction model (CPM), fixed sample size calculations are typically conducted before data collection based on sensible assumptions. But if the assumptions are inaccurate the actual sample size required to develop a reliable model may be very different. To safeguard against this, adaptive sample size approaches have been proposed, based on sequential evaluation of a models predictive performance. Aim: illustrate and extend sequential sample size calculations for CPM development by (i) proposing stopping rules based on minimising uncertainty (instability) and misclassification of individual-level predictions, and (ii) showcasing how it safeguards against inaccurate fixed sample size calculations. Using the sequential approach repeats the pre-defined model development strategy every time a chosen number (e.g., 100) of participants are recruited and adequately followed up. At each stage, CPM performance is evaluated using bootstrapping, leading to prediction and classification stability statistics and plots, alongside optimism-adjusted measures of calibration and discrimination. Our approach is illustrated for development of acute kidney injury using logistic regression CPMs. The fixed sample size calculation, based on perceived sensible assumptions suggests recruiting 342 patients to minimise overfitting; however, the sequential approach reveals that a much larger sample size of 1100 is required to minimise overfitting (targeting population-level stability). If the stopping rule criteria also target small uncertainty and misclassification probability of individual predictions, the sequential approach suggests an even larger sample size (n=1800). Our sequential sample size approach allows users to dynamically monitor individual-level prediction and classification instability and safeguard against using inaccurate assumptions.
Problem

Research questions and friction points this paper is trying to address.

Dynamic sample size determination for clinical prediction models
Minimizing uncertainty and misclassification in individual predictions
Safeguarding against inaccurate fixed sample size assumptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential sample size calculations with adaptive stopping rules
Bootstrapping evaluates prediction stability and misclassification probability
Dynamic monitoring safeguards against inaccurate fixed sample size assumptions
🔎 Similar Papers
No similar papers found.
A
Amardeep Legha
Department of Applied Health Sciences , School of Health Sciences, College of Medicine and Health, University of Birmingham , Birmingham, United Kingdom.
J
Joie Ensor
Department of Applied Health Sciences , School of Health Sciences, College of Medicine and Health, University of Birmingham , Birmingham, United Kingdom.
R
Rebecca Whittle
Department of Applied Health Sciences , School of Health Sciences, College of Medicine and Health, University of Birmingham , Birmingham, United Kingdom.
L
Lucinda Archer
Institute of Data and AI, University of Birmingham , United Kingdom.
Ben Van Calster
Ben Van Calster
Professor of Medical Statistics, KU Leuven
Prediction modelingbiostatistics
E
Evangelia Christodoulou
German Cancer Research Center (DKFZ) Heidelberg, Division of Intelligent Medical Systems, Heidelberg, Germany
K
Kym I. E. Snell
Department of Applied Health Sciences , School of Health Sciences, College of Medicine and Health, University of Birmingham , Birmingham, United Kingdom.
Mohsen Sadatsafavi
Mohsen Sadatsafavi
Associate Professor, the University of British Columbia
EpidemiologyBiostatisticsHealth EconomicsRespiratory Diseases
Gary S. Collins
Gary S. Collins
Professor of Medical Statistics, University of Birmingham
medical statisticsstatisticsbiostatisticsmachine learningmetascience
R
Richard D. Riley
Department of Applied Health Sciences , School of Health Sciences, College of Medicine and Health, University of Birmingham , Birmingham, United Kingdom.