On Design Principles for Private Adaptive Optimizers

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

In differentially private (DP) training, adding spherical noise to gradients severely degrades the performance of adaptive optimizers such as AdaGrad and Adam; existing mitigation strategies are mostly validated on simplistic tasks, raising concerns about generalizability. This paper provides a systematic theoretical analysis, identifying a key flaw: the pursuit of unbiased estimation of the gradient’s second moment is fundamentally incompatible with DP constraints. We propose a novel “scale-then-privatize” paradigm—first scaling gradients via adaptive statistics, then applying DP noise—which deliberately sacrifices unbiasedness to achieve superior noise robustness and convergence guarantees. Our method is naturally compatible with standard DP gradient perturbation mechanisms (e.g., DP-SGD). Empirically, on small-language-model training tasks, it consistently outperforms all existing adaptive DP optimizers in utility, privacy-utility trade-off, and training stability. The approach bridges theory and practice, offering both rigorous justification and strong empirical validation.

Technology Category

Application Category

📝 Abstract

The spherical noise added to gradients in differentially private (DP) training undermines the performance of adaptive optimizers like AdaGrad and Adam, and hence many recent works have proposed algorithms to address this challenge. However, the empirical results in these works focus on simple tasks and models and the conclusions may not generalize to model training in practice. In this paper we survey several of these variants, and develop better theoretical intuition for them as well as perform empirical studies comparing them. We find that a common intuition of aiming for unbiased estimates of second moments of gradients in adaptive optimizers is misguided, and instead that a simple technique called scale-then-privatize (which does not achieve unbiased second moments) has more desirable theoretical behaviors and outperforms all other variants we study on a small-scale language model training task. We additionally argue that scale-then-privatize causes the noise addition to better match the application of correlated noise mechanisms which are more desirable to use in practice.

Problem

Research questions and friction points this paper is trying to address.

Address performance issues of adaptive optimizers in DP training

Evaluate existing algorithms on practical model training scenarios

Propose scale-then-privatize as a superior technique for DP optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-then-privatize technique for DP optimizers

Unbiased second moments are misguided in DP

Correlated noise mechanisms improve practical performance

🔎 Similar Papers

No similar papers found.