Memory-Efficient Differentially Private Training with Gradient Random Projection

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

In differentially private (DP) training, per-sample gradient clipping in methods like DP-Adam incurs prohibitively high GPU memory overhead, severely limiting scalability to large models. To address this, we propose DP-GRAPE—a novel algorithm that replaces costly singular value decomposition (SVD)-based subspace estimation with randomized Gaussian projection, enabling on-the-fly low-dimensional gradient projection and privatization during backpropagation. This design eliminates the SVD computational bottleneck while preserving rigorous $(varepsilon,delta)$-differential privacy and achieving privacy–utility trade-offs comparable to DP-SGD. Experiments demonstrate that DP-GRAPE reduces GPU memory consumption by 63% during ViT pretraining and by 70% during RoBERTa-Large fine-tuning. Notably, it enables, for the first time, DP fine-tuning of the OPT-6.7B model—matching DP-Adam in both accuracy and training speed while drastically lowering memory requirements.

Technology Category

Application Category

📝 Abstract

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. Rather than directly applying DP to GaLore, DP-GRAPE introduces three key modifications: (1) gradients are privatized after projection, (2) random Gaussian matrices replace SVD-based subspaces, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility trade-off comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory usage in DP neural network training

Maintains utility comparable to first-order DP methods

Scales to large models like OPT with 6.7B parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Privatizes gradients after random projection

Uses Gaussian matrices instead of SVD

Applies projection during backpropagation

🔎 Similar Papers

Differentially Private Block-wise Gradient Shuffle for Deep Learning