SynthIPD: assumption-lean synthetic individual patient data generation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In clinical research, individual patient data (IPD) are essential for survival analysis but are often inaccessible due to privacy constraints, high sharing costs, and restricted access. To address this, we propose a three-step, IPD-free framework that synthesizes realistic IPD solely from published clinical trial reports. First, SVG-based Kaplan–Meier (KM) curve parsing extracts time-to-event survival probabilities and at-risk/censoring counts with high fidelity. Second, subgroup-level summary statistics are integrated to generate interpretable, clinically plausible covariate distributions. Third, synthetic IPD with authentic survival endpoint structures—including censoring patterns and hazard dynamics—are generated without relying on black-box models or strong parametric assumptions. Two case studies and simulation experiments demonstrate high fidelity: synthesized data accurately reproduce original KM curves, yield consistent Cox regression estimates, and faithfully capture subgroup treatment effects. This approach enhances generalizability and interpretability while providing a reliable, privacy-preserving surrogate for evidence synthesis.

Technology Category

Application Category

📝 Abstract
Individual patient data (IPD) are essential for statistical inference in clinical research. However, privacy concerns, high data-sharing costs, and restrictive access often make IPD unavailable. Conventional synthetic data generation usually relies on black box models such as generative adversial networks. These methods, however, requires a large piece of IPD for model training, may be ungeneralizable and lacks interpretability. This paper introduces an assumption-lean, three-step methodology for generating synthetic IPD with survival endpoints only based on published clinical trial articles. The method mainly leverages Kaplan-Meier (KM) curves with at-risk/censoring information and subgroup-level summary statistics. It digitizes the KM curve using Scalable Vector Graphics (SVG) beyond pixel accuracy and then generates synthetic covariates based on the statistics. We illustrate the method's potential through $2$ detailed case studies and simulation studies. The method offers important implications, enabling high-fidelity IPD generation to support evidence-based medical decisions.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic patient data when individual clinical data is unavailable
Uses only published trial summaries like Kaplan-Meier curves and statistics
Provides interpretable alternative to black-box generative models for IPD
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic patient data from published articles
Digitizes Kaplan-Meier curves using Scalable Vector Graphics
Creates synthetic covariates from subgroup summary statistics
🔎 Similar Papers
No similar papers found.
Z
Zixuan Zhao
Department of Statistics, The George Washington University
Z
Zexin Ren
Department of Statistics, The George Washington University
G
Guannan Zhai
Department of Statistics, The George Washington University
F
Feifang Hu
Department of Statistics, The George Washington University
Will Ma
Will Ma
Columbia University
E
En Xie
HopeAI,Inc.
Q
Qian Shi
Mayo Clinic