🤖 AI Summary
Differential privacy machine learning (DPML) suffers from inconsistent evaluation protocols, heterogeneous implementations, and poor reproducibility, undermining the credibility of state-of-the-art (SoTA) claims. To address this, we conduct the first systematic benchmarking study of 11 cutting-edge DPML methods, rigorously assessing their reproducibility and transferability via controlled experiments—standardizing datasets, model architectures, and foundational techniques (e.g., DP-SGD) across diverse execution environments. Our results reveal substantial performance degradation for most methods outside their original configurations, confirming a reproducibility crisis in DPML. We identify critical factors—including DP noise sensitivity and hyperparameter coupling—and propose targeted mitigation strategies. Furthermore, we establish the first open DPML reproducibility benchmark and release a best-practice guideline to foster more scientific, comparable, and reliable DPML research.
📝 Abstract
There is a flurry of recent research papers proposing novel differentially private machine learning (DPML) techniques. These papers claim to achieve new state-of-the-art (SoTA) results and offer empirical results as validation. However, there is no consensus on which techniques are most effective or if they genuinely meet their stated claims. Complicating matters, heterogeneity in codebases, datasets, methodologies, and model architectures make direct comparisons of different approaches challenging.
In this paper, we conduct a reproducibility and replicability (R+R) experiment on 11 different SoTA DPML techniques from the recent research literature. Results of our investigation are varied: while some methods stand up to scrutiny, others falter when tested outside their initial experimental conditions. We also discuss challenges unique to the reproducibility of DPML, including additional randomness due to DP noise, and how to address them. Finally, we derive insights and best practices to obtain scientifically valid and reliable results.