Quantifying and Mitigating Privacy Risks for Tabular Generative Models

📅 2024-03-12

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Balancing utility and privacy in tabular data synthesis—particularly against membership inference attacks—remains a critical challenge. To address this, we propose DP-TLDM, the first differentially private latent diffusion model for tabular data. DP-TLDM integrates an autoencoder with a latent diffusion architecture and is grounded in the *f*-divergence differential privacy (*f*-DP) framework, incorporating DP-SGD, batch-wise gradient clipping, and a separation-value metric to ensure end-to-end privacy guarantees. We systematically evaluate utility–privacy trade-offs across five state-of-the-art synthesizers under eight distinct adversarial attacks. Results show that, at equal privacy budgets, DP-TLDM improves distributional similarity by 35%, enhances downstream task performance by 15%, and reduces membership inference distinguishability by 50% compared to existing DP tabular generators. These advances significantly strengthen both privacy interpretability and practical applicability.

Technology Category

Application Category

📝 Abstract

Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.

Problem

Research questions and friction points this paper is trying to address.

Ensures synthetic data resembles original without privacy leaks

Addresses membership inference attacks via differential privacy

Improves data quality while maintaining low privacy risk

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses autoencoder and latent diffusion model

Applies DP-SGD with batch clipping

Enhances synthetic data utility significantly

🔎 Similar Papers

A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models