🤖 AI Summary
Clinical machine learning models exhibit poor generalizability across multi-center settings, primarily due to scarcity of real-world data, severe confounding bias, and the absence of controllable modeling for distributional shifts—such as site-specific variations and subgroup prevalence differences. To address this, we propose the first structured synthetic data framework explicitly designed for clinical generalization validation. Grounded in structural causal models and a hierarchical generative mechanism, it enables explicit, decoupled control over site-specific priors, targeted bias injection, and interpretable feature interactions. The framework facilitates systematic benchmarking of model robustness, fairness, and generalization—allowing isolation of site-level variation, enabling fairness auditing, and revealing interaction-driven failure modes between model complexity and site effects. Evaluated across multiple clinical prediction tasks, it significantly improves the reliability and traceability of generalization attribution.
📝 Abstract
Ensuring the generalisability of clinical machine learning (ML) models across diverse healthcare settings remains a significant challenge due to variability in patient demographics, disease prevalence, and institutional practices. Existing model evaluation approaches often rely on real-world datasets, which are limited in availability, embed confounding biases, and lack the flexibility needed for systematic experimentation. Furthermore, while generative models aim for statistical realism, they often lack transparency and explicit control over factors driving distributional shifts. In this work, we propose a novel structured synthetic data framework designed for the controlled benchmarking of model robustness, fairness, and generalisability. Unlike approaches focused solely on mimicking observed data, our framework provides explicit control over the data generating process, including site-specific prevalence variations, hierarchical subgroup effects, and structured feature interactions. This enables targeted investigation into how models respond to specific distributional shifts and potential biases. Through controlled experiments, we demonstrate the framework's ability to isolate the impact of site variations, support fairness-aware audits, and reveal generalisation failures, particularly highlighting how model complexity interacts with site-specific effects. This work contributes a reproducible, interpretable, and configurable tool designed to advance the reliable deployment of ML in clinical settings.