Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of trustworthy model evaluation—across both population-level and fine-grained subgroups (e.g., age × sex × race intersections)—under privacy constraints on ICU time-series data. We propose Enhanced TimeAutoDiff, a novel framework integrating latent-space diffusion modeling with distribution alignment regularization within a unified VAE-diffusion architecture, jointly optimizing synthetic data fidelity and statistical representativeness. Evaluated on MIMIC-III and eICU, our method reduces the TRTS (Train-on-Real, Test-on-Synthetic) performance gap by over 70% and decreases subgroup AUROC estimation error by up to 50%. Moreover, it outperforms evaluation using scarce real-data samples in 72%–84% of 32 subgroups. To our knowledge, this is the first approach enabling high-fidelity, privacy-preserving, and subgroup-generalizable evaluation for critical care AI models.

Technology Category

Application Category

📝 Abstract
We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce extit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70%, achieving $Δ_{TRTS} leq 0.014$ AUROC, while preserving training utility ($Δ_{TSTR} approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50% relative to small real test sets, and outperform them in 72--84% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.
Problem

Research questions and friction points this paper is trying to address.

Enabling granular subgroup-level evaluation of medical predictive models
Generating synthetic ICU time-series data for privacy-preserving model assessment
Reducing performance estimation gaps between synthetic and real data evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced TimeAutoDiff uses distribution-alignment penalties in diffusion
Generates synthetic ICU time-series for subgroup model evaluation
Reduces TRTS gap by over 70% while preserving utility
🔎 Similar Papers
No similar papers found.
Mahmoud Ibrahim
Mahmoud Ibrahim
VITO, Maastricht University
Generative AIMedical AITrustworthy AISynthetic Data
B
Bart Elen
VITO, Belgium
C
Chang Sun
Institute of Data Science, Faculty of Science and Engineering, Maastricht University, Maastricht, The Netherlands
G
Gökhan Ertaylan
VITO, Belgium
Michel Dumontier
Michel Dumontier
Distinguished Professor of Data Science, Maastricht University
data scienceartificial intelligencebiomedical informaticssemantic webontology