Second-Order Fine-Tuning without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer

📅 2024-02-23
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Zeroth-order optimizers for efficient fine-tuning of large language models (LLMs) suffer from slow convergence and suboptimal accuracy due to heterogeneous curvature in the high-dimensional parameter space. Method: This paper introduces diagonal Hessian information into the zeroth-order optimization framework—requiring only one additional forward pass—to enable second-order awareness with negligible memory overhead. The approach integrates diagonal Hessian estimation, forward gradient approximation, and theoretically grounded convergence analysis. Contribution/Results: Extensive experiments across model scales from 350M to 66B parameters demonstrate substantial reductions in training steps, accelerated convergence, and improved downstream task accuracy. Trajectory visualizations and rigorous theoretical proofs further validate the method’s efficacy in enhancing curvature adaptation without compromising efficiency.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Fine-tuning
First-order Optimizers
Zero-order Optimizers Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagonal Hessian Matrix
Zeroth-order Optimizer
Large Language Model Optimization