🤖 AI Summary
In clinical research, individual patient data (IPD) are essential for survival analysis but are often inaccessible due to privacy constraints, high sharing costs, and restricted access. To address this, we propose a three-step, IPD-free framework that synthesizes realistic IPD solely from published clinical trial reports. First, SVG-based Kaplan–Meier (KM) curve parsing extracts time-to-event survival probabilities and at-risk/censoring counts with high fidelity. Second, subgroup-level summary statistics are integrated to generate interpretable, clinically plausible covariate distributions. Third, synthetic IPD with authentic survival endpoint structures—including censoring patterns and hazard dynamics—are generated without relying on black-box models or strong parametric assumptions. Two case studies and simulation experiments demonstrate high fidelity: synthesized data accurately reproduce original KM curves, yield consistent Cox regression estimates, and faithfully capture subgroup treatment effects. This approach enhances generalizability and interpretability while providing a reliable, privacy-preserving surrogate for evidence synthesis.
📝 Abstract
Individual patient data (IPD) are essential for statistical inference in clinical research. However, privacy concerns, high data-sharing costs, and restrictive access often make IPD unavailable. Conventional synthetic data generation usually relies on black box models such as generative adversial networks. These methods, however, requires a large piece of IPD for model training, may be ungeneralizable and lacks interpretability. This paper introduces an assumption-lean, three-step methodology for generating synthetic IPD with survival endpoints only based on published clinical trial articles. The method mainly leverages Kaplan-Meier (KM) curves with at-risk/censoring information and subgroup-level summary statistics. It digitizes the KM curve using Scalable Vector Graphics (SVG) beyond pixel accuracy and then generates synthetic covariates based on the statistics. We illustrate the method's potential through $2$ detailed case studies and simulation studies. The method offers important implications, enabling high-fidelity IPD generation to support evidence-based medical decisions.