🤖 AI Summary
Existing differentially private (DP) training relies predominantly on DP-SGD, which incurs high computational overhead, requires intricate hyperparameter tuning, and lacks native support for sparse gradients and memory efficiency—despite adaptive optimizers being standard in non-private settings. This work proposes DP-MicroAdam, the first DP adaptive optimization algorithm that simultaneously achieves memory efficiency, sparsity awareness, and theoretically optimal convergence. It provides the first rigorous proof under DP constraints that adaptive methods attain the $O(1/sqrt{T})$ convergence rate for non-convex objectives. DP-MicroAdam integrates gradient sparsification, low-rank memory compression, and a privacy-adaptive learning rate coordination mechanism. Experiments on CIFAR-10, ImageNet, and Transformer fine-tuning demonstrate that DP-MicroAdam significantly outperforms existing DP adaptive methods in accuracy, while matching or even surpassing DP-SGD.
📝 Abstract
Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and sparsity-aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal $mathcal{O}(1/sqrt{T})$ rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained transformers. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.