🤖 AI Summary
This work addresses the high memory overhead and unclear convergence benefits of adaptive methods in zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models. The authors identify that high-dimensional ZO gradients exhibit little coordinate-wise heterogeneity, rendering conventional adaptive strategies inefficient. To overcome this, they propose MEAZO, a novel approach that achieves global stepsize adaptation using only a single scalar. MEAZO uniquely combines the optimization performance of ZO-Adam with the memory efficiency of ZO-SGD. Extensive experiments across multiple large language models and tasks demonstrate that MEAZO matches the accuracy of ZO-Adam while maintaining memory consumption close to that of ZO-SGD, and further exhibits superior robustness to stepsize selection.
📝 Abstract
We investigate the effectiveness of adaptive zeroth-order (ZO) optimization for memory-constrained fine-tuning of large language models (LLMs). Contrary to prior claims, we show that adaptive ZO methods such as ZO-Adam offer no convergence advantage over well-tuned ZO-SGD, while incurring significant memory overhead. Our analysis reveals that in high dimensions, ZO gradients lack coordinate-wise heterogeneity, rendering adaptive mechanisms memory inefficient. Leveraging this insight, we propose MEAZO, a memory-efficient adaptive ZO optimizer that tracks only a single scalar for global step size adaptation. We support our method with theoretical convergence guarantees under standard assumptions. Experiments across multiple LLM families and tasks demonstrate that MEAZO matches ZO-Adam's performance with the memory footprint of ZO-SGD. Additional experiments on synthetic quadratic problems and LLM fine-tuning further demonstrate MEAZO's enhanced robustness to step size choices, particularly in grouped or block-structured optimization settings.