Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Zeroth-order (ZO) fine-tuning of large language models (LLMs) suffers from slow convergence, low accuracy, and underutilized memory efficiency. To address these limitations, we propose DiZO, a layer-wise divergence-driven zeroth-order optimization method. DiZO introduces the first layer-wise gradient divergence modeling framework, enabling hierarchical, adaptive, and multi-scale parameter updates via learnable projection-based calibration. Notably, it is the first ZO method to surpass full-parameter first-order (FO) fine-tuning in both accuracy and training efficiency on certain downstream tasks. Extensive evaluation on RoBERTa-large, OPT, and Llama series models demonstrates significantly accelerated convergence—reducing GPU training time by up to 48%—while outperforming state-of-the-art ZO baselines across multiple benchmarks; in several scenarios, DiZO even exceeds standard FO fine-tuning in task performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose extbf{Di}vergence-driven extbf{Z}eroth- extbf{O}rder ( extbf{DiZO}) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory usage in LLM fine-tuning

Improves convergence speed and accuracy

Optimizes layer-wise updates efficiently

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise divergence analysis

Divergence-driven ZO optimization

Diverse-magnitude layer updates

🔎 Similar Papers

Second-Order Fine-Tuning without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer