Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities

📅 2026-04-26

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

While generic generative synthetic data often performs well in predictive tasks, it struggles to preserve causal estimands such as the average treatment effect (ATE). This work proposes a hybrid synthesis framework that decouples covariate generation from the treatment–outcome mechanism, constructing (W, A, Y) triplets by integrating nearest-neighbor distance diagnostics with an independently learned interference model, and further introduces targeted synthetic augmentation to mitigate positivity violations. The study is the first to systematically uncover the failure mechanisms of generative models in causal estimation, advocates for a synthesis strategy that separates covariates from causal mechanisms, and develops a synthetic simulation engine tailored for evaluating causal estimators under limited sample settings. Experiments demonstrate that the proposed approach substantially improves ATE fidelity over fully generative baselines and provides practical diagnostic tools, with consistent effectiveness validated across diverse configurations.

Technology Category

Application Category

📝 Abstract

Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.

Problem

Research questions and friction points this paper is trying to address.

synthetic data

causal inference

average treatment effect

generative models

covariate distribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid synthetic data

causal inference

average treatment effect