Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

📅 2026-01-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Standard first-order optimization methods often converge to sharp minima with poor generalization, while zeroth-order approaches suffer from high variance and slow convergence in the high-dimensional output spaces of large language models. To address these limitations, this work proposes Hi-ZFO, a novel framework that reframes zeroth-order optimization as a form of beneficial stochasticity deliberately introduced to enhance exploration. The method leverages layer-wise importance analysis to partition model parameters: critical layers are updated via first-order optimization to ensure efficient convergence, whereas less sensitive layers employ zeroth-order updates to promote broader exploration. This hybrid strategy consistently achieves superior performance across generative, mathematical reasoning, and code-related tasks while significantly reducing training time.

Technology Category

Application Category

📝 Abstract
Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of"beneficial stochasticity"to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

large language models
fine-tuning
zeroth-order optimization
first-order optimization
estimation variance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimization
First-order optimization
Hierarchical fine-tuning
Importance-guided tensor selection
Beneficial stochasticity