To Shuffle or not to Shuffle: Auditing DP-SGD with Shuffling

📅 2024-11-15

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

200K/year

🤖 AI Summary

This work reveals that replacing Poisson subsampling with data shuffling in DP-SGD leads to severe overestimation of theoretical differential privacy guarantees. To address this, the authors introduce the first empirical privacy auditing framework specifically designed for DP-SGD with shuffling, integrating differential privacy auditing, membership inference attacks, and empirical leakage measurement. They conduct a systematic evaluation across two realistic shuffling variants. Experiments show that mainstream implementations overestimate the privacy budget ε by 2–4× on average, resulting in substantially increased privacy leakage; the degree of overestimation is significantly influenced by batch size, prescribed ε, and threat model. This study provides the first quantitative characterization of the privacy gap induced by shuffling, delivering critical warnings for secure DP-SGD deployment and establishing a reproducible benchmark for empirical privacy assessment.

Technology Category

Application Category

📝 Abstract

Differentially Private Stochastic Gradient Descent (DP-SGD) is a popular method for training machine learning models with formal Differential Privacy (DP) guarantees. As DP-SGD processes the training data in batches, it uses Poisson sub-sampling to select batches at each step. However, due to computational and compatibility benefits, replacing sub-sampling with shuffling has become common practice. Yet, since tight theoretical guarantees for shuffling are currently unknown, prior work using shuffling reports DP guarantees as though Poisson sub-sampling was used. This prompts the need to verify whether this discrepancy is reflected in a gap between the theoretical guarantees from state-of-the-art models and the actual privacy leakage. To do so, we introduce a novel DP auditing procedure to analyze DP-SGD with shuffling. We show that state-of-the-art DP models trained with shuffling appreciably overestimated privacy guarantees (up to 4x). In the process, we assess the impact of several parameters, such as batch size, privacy budget, and threat model, on privacy leakage. Finally, we study two variations of the shuffling procedure found in the wild, which result in further privacy leakage. Overall, our work empirically attests to the risk of using shuffling instead of Poisson sub-sampling vis-`a-vis the actual privacy leakage of DP-SGD.

Problem

Research questions and friction points this paper is trying to address.

Auditing DP-SGD with shuffling for privacy leakage gaps

Comparing theoretical vs actual privacy guarantees in DP-SGD

Assessing impact of shuffling variations on privacy leakage

Innovation

Methods, ideas, or system contributions that make the work stand out.

DP-SGD with shuffling replaces Poisson sub-sampling

Novel DP auditing procedure evaluates privacy leakage

Assesses impact of batch size and privacy budget

🔎 Similar Papers

Differentially Private Block-wise Gradient Shuffle for Deep Learning