🤖 AI Summary
Existing synthetic tabular data methods struggle to model cross-table causal dependencies prevalent in real-world relational databases, leading to distorted and practically limited synthetic data. This paper proposes the first structural causal model (SCM)-based framework for relational tabular data generation, explicitly representing inter-table dependencies as a causal graph and integrating probabilistic graphical models with conditional generative mechanisms to enable joint multi-table modeling. It is the first approach to achieve interpretable and controllable synthesis of complex causal structures extracted from real relational databases. Experiments demonstrate that the generated data significantly outperforms state-of-the-art baselines in structural fidelity, statistical consistency, and downstream task performance—including TabPFN training and evaluation—thereby substantially enhancing the realism and practical utility of synthetic relational data.
📝 Abstract
Synthetic tabular data generation has received increasing attention in recent years, particularly with the emergence of foundation models for tabular data. The breakthrough success of TabPFN (Hollmann et al.,2025), which leverages vast quantities of synthetic tabular datasets derived from structural causal models (SCMs), demonstrates the critical role synthetic data plays in developing powerful tabular foundation models. However, most real-world tabular data exists in relational formats spanning multiple interconnected tables - a structure not adequately addressed by current generation methods. In this work, we extend the SCM-based approach by developing a novel framework that generates realistic synthetic relational tabular data including causal relationships across tables. Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.