🤖 AI Summary
Addressing the core challenges in privacy-preserving data publishing—namely, the trade-off between privacy and utility, heightened vulnerability to outlier leakage, and utility degradation caused by differential privacy (DP) noise—this paper proposes SMOTE-DP, a novel DP synthetic data generation framework. SMOTE-DP is the first to integrate SMOTE’s pattern-shrinking mechanism into the DP synthesis pipeline: it leverages oversampling to guide distributional contraction, jointly injects calibrated DP noise, and incorporates generative model fine-tuning with rigorous privacy budget accounting. Theoretically, the framework guarantees ε-differential privacy with ε ≤ 2 while significantly enhancing privacy robustness against adversarial inference. Empirically, on multiple benchmark datasets, downstream classification and regression tasks achieve an average accuracy improvement of 12.7% over state-of-the-art DP synthetic methods, alongside a 3.8× reduction in measured privacy leakage risk.
📝 Abstract
Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.