🤖 AI Summary
Addressing the challenges of privacy preservation and fusion difficulty due to the absence of shared identifiers across subsets in cross-domain tabular data synthesis, this paper proposes a divide-and-conquer generative framework. It partitions the original data into mutually exclusive subsets, each modeled independently by dedicated generative models, and seamlessly integrates them via a posterior linking mechanism—requiring no shared variables, identifiers, or covariates. The framework supports heterogeneous generative models and significantly strengthens differential privacy guarantees. Crucially, it maintains high data utility while introducing only negligible statistical bias. Extensive experiments on multiple real-world tabular datasets demonstrate superior privacy–utility trade-offs, strong scalability, and broad compatibility with diverse generative modeling architectures.
📝 Abstract
We propose a new framework for generating cross-sectional synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that helps illuminate some of the design choices that one may make. The principal benefit of disjoint generative models is significantly increased privacy at only a low utility cost. Additional findings include increased effectiveness and feasibility for certain model types and the possibility for mixed-model synthesis.