GEM+: Scalable State-of-the-Art Private Synthetic Data with Generator Networks

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing differentially private synthetic tabular data methods (e.g., AIM) suffer from memory explosion, computational inefficiency, and the need for full retraining upon graph-structure changes in high-dimensional settings; while generative approaches like GEM improve scalability, their empirical validation remains limited to small-scale datasets. This paper proposes a novel framework that synergistically integrates AIM’s adaptive measurement mechanism with GEM’s generative neural network, enabling the first end-to-end, differentially private synthesis of high-dimensional tabular data with over one hundred columns. By adaptively selecting and injecting calibrated noise into low-order marginal distributions—and jointly optimizing the generative network—the framework drastically reduces memory footprint and training overhead. Extensive experiments on multiple benchmark datasets demonstrate superior data utility and computational efficiency over AIM and other baselines. Notably, our method successfully synthesizes large-scale, high-dimensional datasets on which AIM fails due to resource constraints.

Technology Category

Application Category

📝 Abstract

State-of-the-art differentially private synthetic tabular data has been defined by adaptive'select-measure-generate'frameworks, exemplified by methods like AIM. These approaches iteratively measure low-order noisy marginals and fit graphical models to produce synthetic data, enabling systematic optimisation of data quality under privacy constraints. Graphical models, however, are inefficient for high-dimensional data because they require substantial memory and must be retrained from scratch whenever the graph structure changes, leading to significant computational overhead. Recent methods, like GEM, overcome these limitations by using generator neural networks for improved scalability. However, empirical comparisons have mostly focused on small datasets, limiting real-world applicability. In this work, we introduce GEM+, which integrates AIM's adaptive measurement framework with GEM's scalable generator network. Our experiments show that GEM+ outperforms AIM in both utility and scalability, delivering state-of-the-art results while efficiently handling datasets with over a hundred columns, where AIM fails due to memory and computational overheads.

Problem

Research questions and friction points this paper is trying to address.

Graphical models require excessive memory and computational resources for high-dimensional data

Existing generator network methods lack evaluation on large real-world datasets

Current approaches struggle to balance privacy constraints with data utility and scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates adaptive measurement with scalable generator networks

Outperforms AIM in utility and scalability metrics

Handles high-dimensional datasets with over hundred columns

🔎 Similar Papers

Machine Learning for Synthetic Data Generation: a Review