Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation

📅 2025-04-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the longstanding trade-off between privacy preservation and data utility in differentially private synthetic data generation (DP-SDG). We propose, for the first time, a **vertical public-private attribute partitioning paradigm**: leveraging naturally occurring public attributes—such as geographic region or age group—as conditioning variables to guide differentially private modeling of private attributes. Unlike mainstream horizontal partitioning approaches—which rely on limited publicly available samples—our framework integrates conditional generative modeling, statistical moment constraints, and principles from vertical federated learning. It guarantees strict ε=1.0 differential privacy while enhancing structural fidelity. Evaluated on multiple benchmark datasets, our method reduces Kolmogorov–Smirnov (KS) distance by 23% and column-wise correlation error by 18% over state-of-the-art baselines. This work establishes a novel pathway toward provably private synthetic data generation with high statistical and structural fidelity.

Technology Category

Application Category

📝 Abstract
Differentially Private Synthetic Data Generation (DP-SDG) is a key enabler of private and secure tabular-data sharing, producing artificial data that carries through the underlying statistical properties of the input data. This typically involves adding carefully calibrated statistical noise to guarantee individual privacy, at the cost of synthetic data quality. Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data. These methods study a horizontal public-private partitioning which assumes access to a small number of public rows that can be used for model initialization, providing a small utility gain. However, realistic datasets often naturally consist of public and private attributes, making a vertical public-private partitioning relevant for practical synthetic data deployments. We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting. We compare this framework against our alternative approach that uses conditional generation, highlighting initial limitations of public-data assisted methods and proposing future research directions to address these challenges.
Problem

Research questions and friction points this paper is trying to address.

Improving synthetic data quality in DP-SDG
Addressing vertical public-private data partitioning
Exploring public-data assisted generation limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vertical public-private split for synthetic data
Adapts horizontal methods to vertical setting
Uses conditional generation for improvement
🔎 Similar Papers
S
Samuel Maddock
Meta Platforms, Inc.
S
Shripad Gade
Meta Platforms, Inc.
Graham Cormode
Graham Cormode
Meta AI, University of Warwick
AlgorithmsData AnalysisHapax LegomenonPrivacy
W
Will Bullock
Meta Platforms, Inc.