SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the core challenges in privacy-preserving data publishing—namely, the trade-off between privacy and utility, heightened vulnerability to outlier leakage, and utility degradation caused by differential privacy (DP) noise—this paper proposes SMOTE-DP, a novel DP synthetic data generation framework. SMOTE-DP is the first to integrate SMOTE’s pattern-shrinking mechanism into the DP synthesis pipeline: it leverages oversampling to guide distributional contraction, jointly injects calibrated DP noise, and incorporates generative model fine-tuning with rigorous privacy budget accounting. Theoretically, the framework guarantees ε-differential privacy with ε ≤ 2 while significantly enhancing privacy robustness against adversarial inference. Empirically, on multiple benchmark datasets, downstream classification and regression tasks achieve an average accuracy improvement of 12.7% over state-of-the-art DP synthetic methods, alongside a 3.8× reduction in measured privacy leakage risk.

Technology Category

Application Category

📝 Abstract
Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
Problem

Research questions and friction points this paper is trying to address.

Balancing privacy-utility tradeoff in synthetic data publication
Preventing outlier information leakage in generative models
Reducing utility loss in differentially private data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SMOTE-DP combines synthetic data with differential privacy
Contracting data patterns enhance privacy-utility balance
Ensures robust privacy without significant utility loss
🔎 Similar Papers
No similar papers found.