🤖 AI Summary
This work addresses the common oversight in existing tabular data generation methods—namely, the neglect of causal relationships among variables, which often leads downstream models to learn spurious or unfair associations. To remedy this, the paper proposes the first approach that explicitly incorporates causal structure, represented as a completed partially directed acyclic graph (CPDAG), into the generation of mixed-type tabular data. The method first orients the CPDAG into a DAG, fits marginal distributions for root nodes, and then learns structural equations following the topological order. Conditional diffusion models and gradient-boosted trees are employed respectively for continuous and categorical variables, enabling efficient ancestral sampling. The framework supports precise counterfactual queries, fairness auditing, and policy simulation, and consistently outperforms state-of-the-art GANs, diffusion models, and large language model–based generators across seven real-world datasets—achieving higher statistical fidelity, better downstream utility, lower privacy risk, fewer rule violations, and up to a 583× speedup in generation time.
📝 Abstract
Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583$\times$ faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.