Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning

📅 2024-02-12
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the trilemma of privacy preservation, model utility, and computational scalability in large language model (LLM) fine-tuning, this paper proposes Phase-wise Differentially Private Zeroth-Order Stochastic Optimization (DP-ZOSO). Unlike conventional DP-SGD, DP-ZOSO eliminates reliance on exact gradients by employing zeroth-order stochastic gradient estimation, thereby circumventing gradient computation bottlenecks. It further introduces a novel dynamic scheduling mechanism that jointly adapts the learning rate and privacy noise magnitude to balance the trade-off between zeroth-order estimation error and privacy-induced perturbation during optimization. We provide theoretical convergence guarantees under differential privacy constraints. Empirical evaluations demonstrate significant improvements: under ε = 4, RoBERTa-Large achieves +4.5% and +5.5% accuracy gains on SST-5 and MNLI, respectively; OPT-2.7B attains +9.2% and +3.9% improvements on CB and BoolQ. Gains are particularly pronounced on complex reasoning tasks.

Technology Category

Application Category

📝 Abstract
Fine-tuning on task-specific datasets is a widely-embraced paradigm of harnessing the powerful capability of pretrained LLMs for various downstream tasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy concerns, differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets. Lying at the design core of DP LLM fine-tuning methods is the satisfactory tradeoff among privacy, utility, and scalability. Most existing methods build upon the seminal work of DP-SGD. Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD. In this paper, we investigate the potential of DP zeroth-order methods for LLM pretraining, which avoids the scalability bottleneck of SGD by approximating the gradient with the more efficient zeroth-order gradient. Rather than treating the zeroth-order method as a drop-in replacement for SGD, this paper presents a comprehensive study both theoretically and empirically. First, we propose the stagewise DP zeroth-order method (DP-ZOSO) that dynamically schedules key hyperparameters. This design is grounded on the synergy between DP random perturbation and the gradient approximation error of the zeroth-order method, and its effect on fine-tuning trajectory. We provide theoretical analysis for both proposed methods. We conduct extensive empirical analysis on both encoder-only masked language model and decoder-only autoregressive language model, achieving impressive results in terms of scalability and utility regardless of the class of tasks (compared with DPZero, DP-ZOPO improves $4.5%$ on SST-5, $5.5%$ on MNLI with RoBERTa-Large and 9.2% on CB, 3.9% on BoolQ with OPT-2.7b when $epsilon=4$, demonstrates more significant enhancement in performance on more complicated tasks).
Problem

Research questions and friction points this paper is trying to address.

Addresses privacy concerns in LLM fine-tuning using DP methods.
Proposes DP zeroth-order methods to overcome SGD scalability issues.
Enhances scalability and utility in DP fine-tuning for various tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

DP zeroth-order methods for LLM fine-tuning
Stagewise DP-ZOSO with dynamic hyperparameter scheduling
Theoretical and empirical analysis of DP-ZOPO performance
🔎 Similar Papers
No similar papers found.