Compatibility of Missing Data Handling Methods across the Stages of Producing Clinical Prediction Models

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Clinical prediction models (CPMs) often employ inconsistent missing-data handling strategies across their lifecycle—particularly overlooking deployment-stage constraints on missingness, leading to biased performance estimation. This study introduces the novel concept of “compatibility” to systematically evaluate consistency among missing-data methods across development, validation, and deployment phases. Using simulation studies, real-world thoracic surgical data, and a comprehensive case study, we compare mean imputation, single regression imputation, multiple imputation (MI), and pattern submodeling. Results indicate that when deployment prohibits missing values, MI should be used throughout all phases; when missingness is permitted at deployment, the same imputation method must be applied in both development and validation. Crucially, this work demonstrates how deployment constraints retroactively influence upstream modeling decisions. Our findings provide methodological guidance for robust CPM development and clinical translation, emphasizing phase-aligned missing-data strategies to ensure valid performance estimation and real-world applicability.

Technology Category

Application Category

📝 Abstract
Missing data is a challenge when developing, validating and deploying clinical prediction models (CPMs). Traditionally, decisions concerning missing data handling during CPM development and validation havent accounted for whether missingness is allowed at deployment. We hypothesised that the missing data approach used during model development should optimise model performance upon deployment, whilst the approach used during model validation should yield unbiased predictive performance estimates upon deployment; we term this compatibility. We aimed to determine which combinations of missing data handling methods across the CPM life cycle are compatible. We considered scenarios where CPMs are intended to be deployed with missing data allowed or not, and we evaluated the impact of that choice on earlier modelling decisions. Through a simulation study and an empirical analysis of thoracic surgery data, we compared CPMs developed and validated using combinations of complete case analysis, mean imputation, single regression imputation, multiple imputation, and pattern sub-modelling. If planning to deploy a CPM without allowing missing data, then development and validation should use multiple imputation when required. Where missingness is allowed at deployment, the same imputation method must be used during development and validation. Commonly used combinations of missing data handling methods result in biased predictive performance estimates.
Problem

Research questions and friction points this paper is trying to address.

Compatibility of missing data methods in clinical prediction models
Impact of deployment missingness on development and validation choices
Optimal missing data handling for unbiased performance estimates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multiple imputation for no missing data deployment
Same imputation method for deployment missingness
Compatibility ensures unbiased performance estimates
🔎 Similar Papers
No similar papers found.
A
Antonia Tsvetanova
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
M
M. Sperrin
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
D
David A. Jenkins
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
Niels Peek
Niels Peek
The Healthcare Improvement Studies Institute, University of Cambridge
data sciencehealthcare improvementhealth informaticsartificial intelligence
I
Iain E. Buchan
Civic Health Innovation Labs, The University of Liverpool, Liverpool, UK
S
Stephanie Hyland
Microsoft Research, Cambridge, UK
M
Marcus Taylor
Department of Cardiothoracic Surgery, Manchester University Hospital NHS foundation Trust, Manchester, UK
Angela Wood
Angela Wood
University of Cambridge
StatisticsBiostatisticsEpidemiologyHealth Data Science
Richard D Riley
Richard D Riley
University of Birmingham, UK.
Meta-analysisprognosis researchrisk prediction
G
Glen P. Martin
Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom