Leveraging Coordinate Momentum in SignSGD and Muon: Memory-Optimized Zero-Order

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

To address the excessive memory and computational overhead in parameter-efficient fine-tuning (PEFT) of large language models (LLMs), this paper proposes two zeroth-order optimization algorithms: JAGUAR SignSGD and JAGUAR Muon. First, it establishes the first rigorous convergence guarantee for stochastic zeroth-order SignSGD. Second, it introduces the first Muon variant incorporating structural priors of parameter matrices—specifically, low-rank structure—and provides a universal convergence rate analysis under arbitrary noise. Both methods integrate sign-based gradient updates, coordinate-wise momentum, and LoRA adaptation. Empirically, they match or surpass the convergence quality of Adam/SGD on LLM fine-tuning tasks while reducing memory complexity to O(1) and requiring only a single function evaluation per iteration—significantly lowering resource consumption.

Technology Category

Application Category

📝 Abstract

Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $ exttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $ exttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM

Problem

Research questions and friction points this paper is trying to address.

Memory-efficient fine-tuning for large language models

Reducing computational costs in zero-order optimization

Convergence guarantees for SignSGD in stochastic cases

Innovation

Methods, ideas, or system contributions that make the work stand out.

ZO momentum-based algorithm for SignSGD

Matrix-structured ZO extension of Muon

Memory-efficient zero-order LLM fine-tuning

🔎 Similar Papers

Ordered Momentum for Asynchronous SGD

2024-07-27arXiv.orgCitations: 0

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

2024-02-04Citations: 1

Moveworks

Mountain View, CA / Mountain View, CA - HQ, Mountain View, California, United States

Research Engineer, Monetization AI