Improving TabPFN's Synthetic Data Generation by Integrating Causal Structure

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of TabPFN in synthetic tabular data generation, where ignoring causal structure introduces spurious correlations and compromises causal effect fidelity. To mitigate this, the authors propose a causal-aware autoregressive generation method that dynamically guides the variable sampling order by incorporating a directed acyclic graph (DAG) or its equivalence class (CPDAG) into TabPFNโ€™s conditional sampling process. The approach effectively enhances causal consistency of the generated data under both full and partial causal knowledge. Empirical evaluations across multiple benchmarks and CSuite datasets demonstrate significant improvements in structural fidelity, distributional alignment, and preservation of average treatment effects (ATE), thereby yielding more reliable synthetic data for downstream causal inference tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN's generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.
Problem

Research questions and friction points this paper is trying to address.

synthetic tabular data
causal structure
spurious correlations
autoregressive generation
causal effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal structure
synthetic tabular data
TabPFN
DAG-aware conditioning
CPDAG
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Davide Tugnoli
Department of Mathematics, Informatics and Geosciences, University of Trieste, Trieste, Italy
A
Andrea De Lorenzo
Department of Engineering and Architecture, University of Trieste, Trieste, Italy
M
Marco Virgolin
InSilicoTrials Technologies BV , The Netherlands
Giovanni Cinร 
Giovanni Cinร 
Amsterdam University Medical Center | University of Amsterdam
Medical AIMachine LearningMathematical Logic