FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of generating high-quality synthetic electronic health records (EHRs) in privacy-sensitive, cross-hospital settings where sharing real EHR data is infeasible. Existing federated generative approaches struggle with the high-dimensional sparsity of EHRs and inter-institutional data heterogeneity, often leading to model collapse. To overcome these limitations, we propose FedEHR-Gen, the first federated temporal EHR synthesis framework tailored for distributed hospitals. Our method employs a two-stage paradigm: first, a federated autoencoder maps EHRs into a compact latent space with a layer-wise matching aggregation mechanism to align encoders across sites; then, a federated temporal conditional variational autoencoder (TCVAE) equipped with distribution-aware aggregation is trained in this aligned latent space to stably generate realistic sequences. Experiments on eICU and MIMIC-III demonstrate that FedEHR-Gen achieves synthetic data fidelity, utility for downstream tasks, and privacy guarantees comparable to centralized training, significantly outperforming existing federated baselines.

📝 Abstract

Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Synthetic EHR Generation

Time-Series Data

Data Heterogeneity

Privacy-Preserving

Innovation

Methods, ideas, or system contributions that make the work stand out.

federated learning

synthetic EHR generation

latent space alignment