Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking

📅 2024-11-06

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing synthetic electronic health record (EHR) generation methods lack systematic, multidimensional evaluation. Method: We introduce the first comprehensive benchmark framework assessing synthetic EHRs across four dimensions—data fidelity, downstream task utility, privacy preservation, and computational overhead—and uniformly evaluate seven representative open-source approaches (including GANs, rule-based engines, and differential privacy–inspired methods) on MIMIC-III and MIMIC-IV. We propose a clinically informed selection decision tree and present two novel models: CorGAN, which explicitly captures inter-variable correlations, and an enhanced MedGAN tailored for predictive modeling. Contribution/Results: Our analysis reveals that GAN-based methods (e.g., MedGAN, CorGAN) achieve optimal fidelity and utility under distributional shift, whereas rule-based methods offer strongest privacy guarantees. We open-source SynthEHRella—a unified toolkit enabling one-click reproduction and cross-method comparison—to advance reproducible, deployable synthetic EHR research and practice.

Technology Category

Application Category

📝 Abstract

We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella'', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.

Problem

Research questions and friction points this paper is trying to address.

Review and benchmark synthetic EHR data generation methods

Evaluate methods on fidelity, utility, privacy, and cost

Provide open-source tools for method comparison and selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking GAN-based and rule-based EHR methods

Open-source Python package for synthetic EHR

Decision tree guides method selection

🔎 Similar Papers

Fairness-Optimized Synthetic EHR Generation for Arbitrary Downstream Predictive Tasks