SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Survival analysis requires synthetic data generation that jointly models event-time distributions and censoring mechanisms—a challenge unaddressed by existing methods. This paper introduces diffusion models to survival analysis for the first time, proposing an end-to-end joint generative framework that simultaneously models mixed-type covariates, event times, and right-censoring indicators. We design a survival-specific loss function that directly optimizes downstream evaluation metrics—including the concordance index (C-index) and Brier score—thereby bridging the gap between generative fidelity and predictive utility. Extensive experiments on multiple real-world clinical datasets demonstrate that our method significantly outperforms state-of-the-art synthetic data generators in both distributional fidelity (e.g., Kolmogorov–Smirnov test statistics and Wasserstein distance) and downstream survival prediction performance. The proposed approach establishes a new paradigm for privacy-preserving modeling and data augmentation in clinical research.

Technology Category

Application Category

📝 Abstract
Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff, an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii) preserves the censoring mechanism. Across multiple datasets, we show that survdiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and downstream evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first diffusion model explicitly designed for generating synthetic survival data.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic survival data with realistic event-time distributions
Preserving censoring mechanisms in incomplete survival datasets
Jointly modeling covariates, event times, and right-censoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end diffusion model for survival data generation
Jointly generates covariates, event times, and censoring
Uses survival-tailored loss function for downstream optimization
🔎 Similar Papers
No similar papers found.