SMOTE and Mirrors: Exposing Privacy Leakage from Synthetic Minority Oversampling

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper exposes a critical privacy leakage risk of SMOTE in privacy-sensitive settings: its minority-class oversampling process inadvertently reveals original sensitive records, undetectable by conventional evaluation methods. To demonstrate this vulnerability, we propose two novel adversarial attacks—DistinSMOTE, which exploits geometric feature disparities to distinguish real from synthetic samples, and ReconSMOTE, which achieves high-fidelity reconstruction of original minority-class instances. Leveraging membership inference, distance-based analysis, and geometric modeling—supported by theoretical proofs and extensive experiments—we evaluate both attacks across eight diverse imbalanced datasets. Under typical class-imbalance ratios, both methods achieve near-perfect recall and precision (≈100%). This work provides the first systematic evidence that SMOTE offers no inherent privacy protection, delivering a crucial cautionary insight and establishing a new benchmark for co-designing fairness-aware balancing techniques and privacy-preserving machine learning.

Technology Category

Application Category

📝 Abstract
The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most widely used methods for addressing class imbalance and generating synthetic data. Despite its popularity, little attention has been paid to its privacy implications; yet, it is used in the wild in many privacy-sensitive applications. In this work, we conduct the first systematic study of privacy leakage in SMOTE: We begin by showing that prevailing evaluation practices, i.e., naive distinguishing and distance-to-closest-record metrics, completely fail to detect any leakage and that membership inference attacks (MIAs) can be instantiated with high accuracy. Then, by exploiting SMOTE's geometric properties, we build two novel attacks with very limited assumptions: DistinSMOTE, which perfectly distinguishes real from synthetic records in augmented datasets, and ReconSMOTE, which reconstructs real minority records from synthetic datasets with perfect precision and recall approaching one under realistic imbalance ratios. We also provide theoretical guarantees for both attacks. Experiments on eight standard imbalanced datasets confirm the practicality and effectiveness of these attacks. Overall, our work reveals that SMOTE is inherently non-private and disproportionately exposes minority records, highlighting the need to reconsider its use in privacy-sensitive applications.
Problem

Research questions and friction points this paper is trying to address.

Exposing privacy leakage in SMOTE data synthesis
Developing attacks to distinguish and reconstruct records
Revealing disproportionate minority record exposure risks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Membership inference attacks reveal SMOTE privacy leakage
DistinSMOTE perfectly distinguishes real from synthetic records
ReconSMOTE reconstructs real minority records from synthetic data
🔎 Similar Papers
No similar papers found.