🤖 AI Summary
This paper addresses the longstanding trade-off between privacy preservation and data utility in differentially private synthetic data generation (DP-SDG). We propose, for the first time, a **vertical public-private attribute partitioning paradigm**: leveraging naturally occurring public attributes—such as geographic region or age group—as conditioning variables to guide differentially private modeling of private attributes. Unlike mainstream horizontal partitioning approaches—which rely on limited publicly available samples—our framework integrates conditional generative modeling, statistical moment constraints, and principles from vertical federated learning. It guarantees strict ε=1.0 differential privacy while enhancing structural fidelity. Evaluated on multiple benchmark datasets, our method reduces Kolmogorov–Smirnov (KS) distance by 23% and column-wise correlation error by 18% over state-of-the-art baselines. This work establishes a novel pathway toward provably private synthetic data generation with high statistical and structural fidelity.
📝 Abstract
Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.