🤖 AI Summary
Balancing utility and privacy in tabular data synthesis—particularly against membership inference attacks—remains a critical challenge. To address this, we propose DP-TLDM, the first differentially private latent diffusion model for tabular data. DP-TLDM integrates an autoencoder with a latent diffusion architecture and is grounded in the *f*-divergence differential privacy (*f*-DP) framework, incorporating DP-SGD, batch-wise gradient clipping, and a separation-value metric to ensure end-to-end privacy guarantees. We systematically evaluate utility–privacy trade-offs across five state-of-the-art synthesizers under eight distinct adversarial attacks. Results show that, at equal privacy budgets, DP-TLDM improves distributional similarity by 35%, enhances downstream task performance by 15%, and reduces membership inference distinguishability by 50% compared to existing DP tabular generators. These advances significantly strengthen both privacy interpretability and practical applicability.
📝 Abstract
Synthetic data from generative models emerges as the privacy-preserving data-sharing solution. Such a synthetic data set shall resemble the original data without revealing identifiable private information. The backbone technology of tabular synthesizers is rooted in image generative models, ranging from Generative Adversarial Networks (GANs) to recent diffusion models. Recent prior work sheds light on the utility-privacy tradeoff on tabular data, revealing and quantifying privacy risks on synthetic data. We first conduct an exhaustive empirical analysis, highlighting the utility-privacy tradeoff of five state-of-the-art tabular synthesizers, against eight privacy attacks, with a special focus on membership inference attacks. Motivated by the observation of high data quality but also high privacy risk in tabular diffusion, we propose DP-TLDM, Differentially Private Tabular Latent Diffusion Model, which is composed of an autoencoder network to encode the tabular data and a latent diffusion model to synthesize the latent tables. Following the emerging f-DP framework, we apply DP-SGD to train the auto-encoder in combination with batch clipping and use the separation value as the privacy metric to better capture the privacy gain from DP algorithms. Our empirical evaluation demonstrates that DP-TLDM is capable of achieving a meaningful theoretical privacy guarantee while also significantly enhancing the utility of synthetic data. Specifically, compared to other DP-protected tabular generative models, DP-TLDM improves the synthetic quality by an average of 35% in data resemblance, 15% in the utility for downstream tasks, and 50% in data discriminability, all while preserving a comparable level of privacy risk.