Zeroth-Order Optimization at the Edge of Stability

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the unclear stability mechanisms of zeroth-order (ZO) optimization methods in deep learning, particularly the lack of theoretical characterization regarding the relationship between step size and the Hessian spectrum. Through mean-square linear stability analysis, we reveal for the first time that the stability condition of ZO methods depends on the full Hessian spectrum rather than solely on its largest eigenvalue—as is typical for first-order methods. We derive a computable stability boundary requiring only the largest eigenvalue and the trace of the Hessian, and further uncover that large step sizes implicitly regularize the Hessian trace in ZO optimization. These theoretical findings apply to ZO-GD, ZO-GDM, and ZO-Adam, and are empirically validated across diverse deep learning tasks, where these methods operate near the predicted stability edge.

Technology Category

Application Category

📝 Abstract

Zeroth-order (ZO) methods are widely used when gradients are unavailable or prohibitively expensive, including black-box learning and memory-efficient fine-tuning of large models, yet their optimization dynamics in deep learning remain underexplored. In this work, we provide an explicit step size condition that exactly captures the (mean-square) linear stability of a family of ZO methods based on the standard two-point estimator. Our characterization reveals a sharp contrast with first-order (FO) methods: whereas FO stability is governed solely by the largest Hessian eigenvalue, mean-square stability of ZO methods depends on the entire Hessian spectrum. Since computing the full Hessian spectrum is infeasible in practical neural network training, we further derive tractable stability bounds that depend only on the largest eigenvalue and the Hessian trace. Empirically, we find that full-batch ZO methods operate at the edge of stability: ZO-GD, ZO-GDM, and ZO-Adam consistently stabilize near the predicted stability boundary across a range of deep learning training problems. Our results highlight an implicit regularization effect specific to ZO methods, where large step sizes primarily regularize the Hessian trace, whereas in FO methods they regularize the top eigenvalue.

Problem

Research questions and friction points this paper is trying to address.

Zeroth-order optimization

Stability analysis

Hessian spectrum

Deep learning

Implicit regularization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order optimization

Edge of stability

Hessian spectrum