🤖 AI Summary
This work addresses the challenge of generating high-quality synthetic electronic health records (EHRs) in privacy-sensitive, cross-hospital settings where sharing real EHR data is infeasible. Existing federated generative approaches struggle with the high-dimensional sparsity of EHRs and inter-institutional data heterogeneity, often leading to model collapse. To overcome these limitations, we propose FedEHR-Gen, the first federated temporal EHR synthesis framework tailored for distributed hospitals. Our method employs a two-stage paradigm: first, a federated autoencoder maps EHRs into a compact latent space with a layer-wise matching aggregation mechanism to align encoders across sites; then, a federated temporal conditional variational autoencoder (TCVAE) equipped with distribution-aware aggregation is trained in this aligned latent space to stably generate realistic sequences. Experiments on eICU and MIMIC-III demonstrate that FedEHR-Gen achieves synthetic data fidelity, utility for downstream tasks, and privacy guarantees comparable to centralized training, significantly outperforming existing federated baselines.
📝 Abstract
Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.