Private Training&Data Generation by Clustering Embeddings

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

To address the privacy risk wherein deep learning models inadvertently memorize and leak raw samples in sensitive-data scenarios, this paper proposes a differential privacy (DP)-driven synthetic embedding generation method. Our approach jointly integrates DP clustering with a Gaussian Mixture Model (GMM) in the embedding space to generate high-fidelity synthetic data with provable privacy guarantees. Notably, we are the first to incorporate DP clustering theory into the GMM fitting process, enabling flexible substitution of encoders and decoders while preserving generality and scalability. Experiments on standard benchmarks demonstrate that training lightweight two-layer networks on our synthetic embeddings achieves state-of-the-art (SOTA) classification accuracy. Moreover, images synthesized via our method attain downstream task performance comparable to current best approaches.

Technology Category

Application Category

📝 Abstract

Deep neural networks often use large, high-quality datasets to achieve high performance on many machine learning tasks. When training involves potentially sensitive data, this process can raise privacy concerns, as large models have been shown to unintentionally memorize and reveal sensitive information, including reconstructing entire training samples. Differential privacy (DP) provides a robust framework for protecting individual data and in particular, a new approach to privately training deep neural networks is to approximate the input dataset with a privately generated synthetic dataset, before any subsequent training algorithm. We introduce a novel principled method for DP synthetic image embedding generation, based on fitting a Gaussian Mixture Model (GMM) in an appropriate embedding space using DP clustering. Our method provably learns a GMM under separation conditions. Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy on standard benchmark datasets. Additionally, we demonstrate that our method can generate realistic synthetic images that achieve downstream classification accuracy comparable to SOTA methods. Our method is quite general, as the encoder and decoder modules can be freely substituted to suit different tasks. It is also highly scalable, consisting only of subroutines that scale linearly with the number of samples and/or can be implemented efficiently in distributed systems.

Problem

Research questions and friction points this paper is trying to address.

Private training of deep neural networks with sensitive data

Generating synthetic datasets to preserve differential privacy

Achieving high classification accuracy with synthetic embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

DP synthetic image embedding via GMM

Privately clusters embeddings for training

Scalable, generalizable encoder-decoder framework

🔎 Similar Papers

Exploiting Defenses against GAN-Based Feature Inference Attacks in Federated Learning