Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing synthetic data generation methods for epidemiological studies suffer from limitations in data fidelity, computational efficiency, and practical usability. Method: We propose an adversarial random forest (ARF)-based framework for tabular synthetic data generation, integrating dimensionality reduction, pre-derived variable construction, and multi-cohort harmonization to substantially reduce computational overhead and improve deployability. Contribution/Results: Evaluated across six real-world epidemiological studies, the synthetic data achieve high statistical fidelity—matching original data in descriptive statistics and inferential analyses (e.g., effect estimates, confidence intervals, significance testing) with mean absolute error <5%. Robust performance persists even under small-sample conditions. This work is the first to directly apply ARF to epidemiological data synthesis and to conduct end-to-end evaluation of statistical utility, establishing a new paradigm for privacy-preserving, reproducible epidemiological research.

Technology Category

Application Category

📝 Abstract
Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research. We propose the use of adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications and compared original with synthetic results. These publications cover blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, based on data from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. Additionally, we assessed the impact of dimensionality and variable complexity on synthesis quality by limiting datasets to variables relevant for individual analyses, including necessary derivations. Across all replicated original studies, results from multiple synthetic data replications consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, the replication outcomes closely matched the original results across various descriptive and inferential analyses. Reducing dimensionality and pre-deriving variables further enhanced both quality and stability of the results.
Problem

Research questions and friction points this paper is trying to address.

Evaluating synthetic data's ability to replicate epidemiological research findings
Assessing adversarial random forests for efficient epidemiological data synthesis
Testing synthetic data performance across diverse health study replications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial random forests for synthetic data
Replication of epidemiological studies validation
Dimensionality reduction enhances synthesis stability
🔎 Similar Papers
No similar papers found.
Jan Kapar
Jan Kapar
PhD Student, University of Bremen
Generative modelingTabular dataMachine learningExplainable artificial intelligence
K
Kathrin Günther
Leibniz Institute for Prevention Research and Epidemiology—BIPS, Bremen, Germany
L
Lori Ann Vallis
Department of Human Health and Nutritional Sciences, University of Guelph, Guelph, Ontario, Canada
K
Klaus Berger
Institute of Epidemiology and Social Medicine, University of Münster, Münster, Germany
N
Nadine Binder
Institute of General Practice/Family Medicine, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg Germany; Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Freiburg, Germany
H
Hermann Brenner
German Cancer Research Center (DKFZ), Heidelberg, Germany
S
Stefanie Castell
Department of Epidemiology, Helmholtz Centre for Infection Research (HZI), Brunswick, Germany
B
Beate Fischer
Department of Epidemiology and Preventive Medicine, University of Regensburg, Regensburg, Germany
V
Volker Harth
Institute for Occupational and Maritime Medicine (ZfAM), University Medical Center Hamburg-Eppendorf, Hamburg, Germany
B
Bernd Holleczek
Saarland Cancer Registry, Saarbrücken, Germany
T
Timm Intemann
Leibniz Institute for Prevention Research and Epidemiology—BIPS, Bremen, Germany
T
Till Ittermann
Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
A
André Karch
Institute of Epidemiology and Social Medicine, University of Münster, Münster, Germany
Thomas Keil
Thomas Keil
University of Zurich
L
Lilian Krist
Institute of Social Medicine, Epidemiology and Health Economics, Charité - Universitätsmedizin Berlin, Berlin, Germany
B
Berit Lange
Department of Epidemiology, Helmholtz Centre for Infection Research (HZI), Brunswick, Germany
M
Michael F. Leitzmann
Department of Epidemiology and Preventive Medicine, University of Regensburg, Regensburg, Germany
K
Katharina Nimptsch
Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Molecular Epidemiology Research Group, Berlin, Germany
N
Nadia Obi
Institute for Occupational and Maritime Medicine (ZfAM), University Medical Center Hamburg-Eppendorf, Hamburg, Germany
I
Iris Pigeot
Leibniz Institute for Prevention Research and Epidemiology—BIPS, Bremen, Germany; Faculty of Mathematics and Computer Science, University of Bremen, Bremen, Germany
T
Tobias Pischon
Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Molecular Epidemiology Research Group, Berlin, Germany; Max Delbrück Center for Molecular Medicine in the Helmholtz Association (MDC), Biobank Technology Platform, Berlin, Germany; Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
T
Tamara Schikowski
IUF – Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany; Department of Environment and Health, School of Public Health, University of Bielefeld, Bielefeld, Germany
B
Börge Schmidt
Institute for Medical Informatics, Biometry and Epidemiology, University Hospital of Essen, University of Duisburg-Essen, Germany
C
Carsten Oliver Schmidt
Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
A
Anja M. Sedlmair
Center for Translational Oncology, University Hospital Regensburg, Regensburg, Germany; Bavarian Cancer Research Center (BZKF), Regensburg, Germany; Department of Epidemiology and Preventive Medicine, University of Regensburg, Regensburg, Germany