Efficient DP-SGD for LLMs with Randomized Clipping

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the high memory and computational overhead of training large language models under differential privacy, particularly in long-context scenarios where scalability remains a significant challenge. The authors propose DP-SGD-RC, a novel approach that, for the first time, integrates Hutchinson’s stochastic trace estimation and its improved variant Hutch++ into DP-SGD to efficiently approximate gradient norms. Combined with a randomized clipping mechanism, this method substantially reduces both memory consumption and computational complexity while maintaining rigorous differential privacy guarantees. Experimental results demonstrate that DP-SGD-RC achieves utility on par with non-private baselines on long-context tasks using the Llama-3.2-1B model, while significantly enhancing scalability.

📝 Abstract

Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection. However, state-of-the-art DP training implementations rely on fast gradient clipping techniques with memory overhead $O(B \min\{T^2, d^2\})$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the model width. This becomes prohibitive as both model size and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with randomized clipping that reduces memory and compute complexity. DP-SGD-RC leverages stochastic trace estimation methods, specifically Hutchinson's estimator[Hutchinson, 1989] and its improved variant, Hutch++[Meyer et al., 2021], to reduce the memory footprint of per-sample gradient norm estimation. We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping. Experiments fine-tuning Llama~3.2-1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute requirements.

Problem

Research questions and friction points this paper is trying to address.

Differential Privacy

Large Language Models

Gradient Clipping

Memory Overhead

DP-SGD

Innovation

Methods, ideas, or system contributions that make the work stand out.

DP-SGD

randomized clipping

Hutch++