🤖 AI Summary
This work addresses the performance degradation of optimizers in differentially private (DP) deep learning. We propose DP-AdamW—the first DP-adapted variant of AdamW supporting decoupled weight decay—and its bias-corrected variant, DP-AdamW-BC. Methodologically, we systematically integrate AdamW’s core mechanisms—including gradient clipping, Gaussian noise injection, and adaptive learning rates—into the DP training framework, ensuring rigorous privacy protection for both first- and second-moment gradient estimates. We provide theoretical guarantees on convergence and privacy budget consumption. Experiments demonstrate that DP-AdamW achieves over 15% accuracy improvement on text classification, up to 5% gain on image classification, and consistent 1% improvement on graph node classification—outperforming DP-SGD, DP-Adam, and other baselines. In contrast, bias correction degrades performance, revealing its non-universality in DP settings.
📝 Abstract
As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study emph{DP-AdamW} and introduce emph{DP-AdamW-BC}, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets ($epsilon = 1, 3, 7$). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15% higher on text classification, up to 5% higher on image classification, and consistently 1% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.