Differentially Private Zeroth-Order Methods for Scalable Large Language Model Finetuning

📅 2024-02-12

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Addressing the trilemma of privacy preservation, model utility, and computational scalability in large language model (LLM) fine-tuning, this paper proposes Phase-wise Differentially Private Zeroth-Order Stochastic Optimization (DP-ZOSO). Unlike conventional DP-SGD, DP-ZOSO eliminates reliance on exact gradients by employing zeroth-order stochastic gradient estimation, thereby circumventing gradient computation bottlenecks. It further introduces a novel dynamic scheduling mechanism that jointly adapts the learning rate and privacy noise magnitude to balance the trade-off between zeroth-order estimation error and privacy-induced perturbation during optimization. We provide theoretical convergence guarantees under differential privacy constraints. Empirical evaluations demonstrate significant improvements: under ε = 4, RoBERTa-Large achieves +4.5% and +5.5% accuracy gains on SST-5 and MNLI, respectively; OPT-2.7B attains +9.2% and +3.9% improvements on CB and BoolQ. Gains are particularly pronounced on complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Fine-tuning on task-specific datasets is a widely-embraced paradigm of harnessing the powerful capability of pretrained LLMs for various downstream tasks. Due to the popularity of LLMs fine-tuning and its accompanying privacy concerns, differentially private (DP) fine-tuning of pretrained LLMs has been widely used to safeguarding the privacy of task-specific datasets. Lying at the design core of DP LLM fine-tuning methods is the satisfactory tradeoff among privacy, utility, and scalability. Most existing methods build upon the seminal work of DP-SGD. Despite pushing the scalability of DP-SGD to its limit, DP-SGD-based fine-tuning methods are unfortunately limited by the inherent inefficiency of SGD. In this paper, we investigate the potential of DP zeroth-order methods for LLM pretraining, which avoids the scalability bottleneck of SGD by approximating the gradient with the more efficient zeroth-order gradient. Rather than treating the zeroth-order method as a drop-in replacement for SGD, this paper presents a comprehensive study both theoretically and empirically. First, we propose the stagewise DP zeroth-order method (DP-ZOSO) that dynamically schedules key hyperparameters. This design is grounded on the synergy between DP random perturbation and the gradient approximation error of the zeroth-order method, and its effect on fine-tuning trajectory. We provide theoretical analysis for both proposed methods. We conduct extensive empirical analysis on both encoder-only masked language model and decoder-only autoregressive language model, achieving impressive results in terms of scalability and utility regardless of the class of tasks (compared with DPZero, DP-ZOPO improves $4.5%$ on SST-5, $5.5%$ on MNLI with RoBERTa-Large and 9.2% on CB, 3.9% on BoolQ with OPT-2.7b when $epsilon=4$, demonstrates more significant enhancement in performance on more complicated tasks).

Problem

Research questions and friction points this paper is trying to address.

Addresses privacy concerns in LLM fine-tuning using DP methods.

Proposes DP zeroth-order methods to overcome SGD scalability issues.

Enhances scalability and utility in DP fine-tuning for various tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

DP zeroth-order methods for LLM fine-tuning

Stagewise DP-ZOSO with dynamic hyperparameter scheduling

Theoretical and empirical analysis of DP-ZOPO performance

🔎 Similar Papers

Private Fine-tuning of Large Language Models with Zeroth-order Optimization