Generating Synthetic Electronic Health Record (EHR) Data: A Review with Benchmarking

📅 2024-11-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic electronic health record (EHR) generation methods lack systematic, multidimensional evaluation. Method: We introduce the first comprehensive benchmark framework assessing synthetic EHRs across four dimensions—data fidelity, downstream task utility, privacy preservation, and computational overhead—and uniformly evaluate seven representative open-source approaches (including GANs, rule-based engines, and differential privacy–inspired methods) on MIMIC-III and MIMIC-IV. We propose a clinically informed selection decision tree and present two novel models: CorGAN, which explicitly captures inter-variable correlations, and an enhanced MedGAN tailored for predictive modeling. Contribution/Results: Our analysis reveals that GAN-based methods (e.g., MedGAN, CorGAN) achieve optimal fidelity and utility under distributional shift, whereas rule-based methods offer strongest privacy guarantees. We open-source SynthEHRella—a unified toolkit enabling one-click reproduction and cross-method comparison—to advance reproducible, deployable synthetic EHR research and practice.

Technology Category

Application Category

📝 Abstract
We conduct a scoping review of existing approaches for synthetic EHR data generation, and benchmark major methods with proposed open-source software to offer recommendations for practitioners. We search three academic databases for our scoping review. Methods are benchmarked on open-source EHR datasets, MIMIC-III/IV. Seven existing methods covering major categories and two baseline methods are implemented and compared. Evaluation metrics concern data fidelity, downstream utility, privacy protection, and computational cost. 42 studies are identified and classified into five categories. Seven open-source methods covering all categories are selected, trained on MIMIC-III, and evaluated on MIMIC-III or MIMIC-IV for transportability considerations. Among them, GAN-based methods demonstrate competitive performance in fidelity and utility on MIMIC-III; rule-based methods excel in privacy protection. Similar findings are observed on MIMIC-IV, except that GAN-based methods further outperform the baseline methods in preserving fidelity. A Python package, ``SynthEHRella'', is provided to integrate various choices of approaches and evaluation metrics, enabling more streamlined exploration and evaluation of multiple methods. We found that method choice is governed by the relative importance of the evaluation metrics in downstream use cases. We provide a decision tree to guide the choice among the benchmarked methods. Based on the decision tree, GAN-based methods excel when distributional shifts exist between the training and testing populations. Otherwise, CorGAN and MedGAN are most suitable for association modeling and predictive modeling, respectively. Future research should prioritize enhancing fidelity of the synthetic data while controlling privacy exposure, and comprehensive benchmarking of longitudinal or conditional generation methods.
Problem

Research questions and friction points this paper is trying to address.

Review and benchmark synthetic EHR data generation methods
Evaluate methods on fidelity, utility, privacy, and cost
Provide open-source tools for method comparison and selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking GAN-based and rule-based EHR methods
Open-source Python package for synthetic EHR
Decision tree guides method selection
🔎 Similar Papers
No similar papers found.
X
Xingran Chen
Department of Biostatistics, University of Michigan
Zhenke Wu
Zhenke Wu
Associate Professor of Biostatistics (with tenure), University of Michigan
StatisticsCausalityDigital HealthPrecision HealthTrustworthy AI
Xu Shi
Xu Shi
University of Michigan
Electronic Health RecordCausal InferenceNegative ControlMachine Translation
H
Hyunghoon Cho
Department of Biomedical Informatics and Data Science, Yale University
B
B. Mukherjee
Department of Biostatistics, Yale University