SynthIPD: assumption-lean synthetic individual patient data generation

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

In clinical research, individual patient data (IPD) are essential for survival analysis but are often inaccessible due to privacy constraints, high sharing costs, and restricted access. To address this, we propose a three-step, IPD-free framework that synthesizes realistic IPD solely from published clinical trial reports. First, SVG-based Kaplan–Meier (KM) curve parsing extracts time-to-event survival probabilities and at-risk/censoring counts with high fidelity. Second, subgroup-level summary statistics are integrated to generate interpretable, clinically plausible covariate distributions. Third, synthetic IPD with authentic survival endpoint structures—including censoring patterns and hazard dynamics—are generated without relying on black-box models or strong parametric assumptions. Two case studies and simulation experiments demonstrate high fidelity: synthesized data accurately reproduce original KM curves, yield consistent Cox regression estimates, and faithfully capture subgroup treatment effects. This approach enhances generalizability and interpretability while providing a reliable, privacy-preserving surrogate for evidence synthesis.

Technology Category

Application Category

📝 Abstract

Individual patient data (IPD) are essential for statistical inference in clinical research. However, privacy concerns, high data-sharing costs, and restrictive access often make IPD unavailable. Conventional synthetic data generation usually relies on black box models such as generative adversial networks. These methods, however, requires a large piece of IPD for model training, may be ungeneralizable and lacks interpretability. This paper introduces an assumption-lean, three-step methodology for generating synthetic IPD with survival endpoints only based on published clinical trial articles. The method mainly leverages Kaplan-Meier (KM) curves with at-risk/censoring information and subgroup-level summary statistics. It digitizes the KM curve using Scalable Vector Graphics (SVG) beyond pixel accuracy and then generates synthetic covariates based on the statistics. We illustrate the method's potential through $2$ detailed case studies and simulation studies. The method offers important implications, enabling high-fidelity IPD generation to support evidence-based medical decisions.

Problem

Research questions and friction points this paper is trying to address.

Generates synthetic patient data when individual clinical data is unavailable

Uses only published trial summaries like Kaplan-Meier curves and statistics

Provides interpretable alternative to black-box generative models for IPD

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic patient data from published articles

Digitizes Kaplan-Meier curves using Scalable Vector Graphics

Creates synthetic covariates from subgroup summary statistics

🔎 Similar Papers

No similar papers found.

Apple

Seattle, United States of America

Simulation Synthetic Data Engineer - Special Projects

Apple

Cupertino, United States of America

Authors to Follow