🤖 AI Summary
To address the challenge of sharing real-world electronic health records (EHRs) under privacy and regulatory constraints, this paper proposes RawMed—the first end-to-end synthetic framework tailored for multi-table, longitudinal EHR data. RawMed innovatively integrates latent-space modeling, textualized data representation, and lightweight compression to preserve the original EHR’s multi-table structure, temporal dynamics, and cross-table relationships with minimal preprocessing. To rigorously assess synthetic quality, we introduce a novel evaluation framework encompassing distributional fidelity, temporal consistency, cross-table relational integrity, and privacy preservation. Extensive experiments on two open-source EHR datasets demonstrate that RawMed significantly outperforms existing baselines in both statistical fidelity and downstream task utility (e.g., prediction and cohort analysis). The implementation is publicly available, ensuring full reproducibility and strong potential for clinical research applications.
📝 Abstract
Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/RawMed.