🤖 AI Summary
This study addresses the challenge of trustworthy model evaluation—across both population-level and fine-grained subgroups (e.g., age × sex × race intersections)—under privacy constraints on ICU time-series data. We propose Enhanced TimeAutoDiff, a novel framework integrating latent-space diffusion modeling with distribution alignment regularization within a unified VAE-diffusion architecture, jointly optimizing synthetic data fidelity and statistical representativeness. Evaluated on MIMIC-III and eICU, our method reduces the TRTS (Train-on-Real, Test-on-Synthetic) performance gap by over 70% and decreases subgroup AUROC estimation error by up to 50%. Moreover, it outperforms evaluation using scarce real-data samples in 72%–84% of 32 subgroups. To our knowledge, this is the first approach enabling high-fidelity, privacy-preserving, and subgroup-generalizable evaluation for critical care AI models.
📝 Abstract
We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce extit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70%, achieving $Δ_{TRTS} leq 0.014$ AUROC, while preserving training utility ($Δ_{TSTR} approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50% relative to small real test sets, and outperform them in 72--84% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.