🤖 AI Summary
This paper investigates whether privacy-preserving data synthesis (PPDS) can safely replace real data for training classifiers, focusing on the fundamental utility–privacy trade-off. To this end, we propose the first end-to-end evaluation framework—encompassing generation, sampling, and classification—that systematically benchmarks state-of-the-art generative models (e.g., GANs and diffusion models) and uniformly quantifies privacy risk via model-agnostic membership inference attacks (MIAs) across diverse benchmark scenarios. Our key contributions are threefold: (1) We empirically uncover intrinsic utility–privacy trade-off patterns across generative architectures; (2) We rigorously assess the efficacy limits of common privacy-mitigation strategies (e.g., differential privacy and output perturbation); and (3) We deliver actionable, scenario-specific guidelines for data publishers—identifying which synthetic data configurations enable safe, utility-preserving substitution of real training data under defined privacy constraints.
📝 Abstract
Advances in generative models have transformed the field of synthetic image generation for privacy-preserving data synthesis (PPDS). However, the field lacks a comprehensive survey and comparison of synthetic image generation methods across diverse settings. In particular, when we generate synthetic images for the purpose of training a classifier, there is a pipeline of generation-sampling-classification which takes private training as input and outputs the final classifier of interest. In this survey, we systematically categorize existing image synthesis methods, privacy attacks, and mitigations along this generation-sampling-classification pipeline. To empirically compare diverse synthesis approaches, we provide a benchmark with representative generative methods and use model-agnostic membership inference attacks (MIAs) as a measure of privacy risk. Through this study, we seek to answer critical questions in PPDS: Can synthetic data effectively replace real data? Which release strategy balances utility and privacy? Do mitigations improve the utility-privacy tradeoff? Which generative models perform best across different scenarios? With a systematic evaluation of diverse methods, our study provides actionable insights into the utility-privacy tradeoffs of synthetic data generation methods and guides the decision on optimal data releasing strategies for real-world applications.