🤖 AI Summary
Zeroth-order (ZO) fine-tuning of large language models (LLMs) suffers from slow convergence, low accuracy, and underutilized memory efficiency. To address these limitations, we propose DiZO, a layer-wise divergence-driven zeroth-order optimization method. DiZO introduces the first layer-wise gradient divergence modeling framework, enabling hierarchical, adaptive, and multi-scale parameter updates via learnable projection-based calibration. Notably, it is the first ZO method to surpass full-parameter first-order (FO) fine-tuning in both accuracy and training efficiency on certain downstream tasks. Extensive evaluation on RoBERTa-large, OPT, and Llama series models demonstrates significantly accelerated convergence—reducing GPU training time by up to 48%—while outperforming state-of-the-art ZO baselines across multiple benchmarks; in several scenarios, DiZO even exceeds standard FO fine-tuning in task performance.
📝 Abstract
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose extbf{Di}vergence-driven extbf{Z}eroth- extbf{O}rder ( extbf{DiZO}) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning.