AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the lack of adaptive step sizes in orthogonal updates within the Muon optimizer. We propose AdaGO, the first algorithm to integrate an AdaGrad-style adaptive mechanism—based on cumulative gradient norm—into orthogonal update frameworks, achieving spectral descent while strictly preserving orthogonality of update directions. The method introduces only a single scalar state variable, ensuring both lightweight implementation and theoretical rigor. Under standard smoothness and stochastic noise assumptions, we prove that AdaGO attains the optimal $O(1/sqrt{T})$ convergence rate for non-convex optimization. Experiments on CIFAR-10 classification and function regression tasks demonstrate that AdaGO significantly outperforms both Muon and Adam, empirically validating the synergistic benefit of combining orthogonality with adaptivity in optimization.

Technology Category

Application Category

📝 Abstract

The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam.

Problem

Research questions and friction points this paper is trying to address.

Adaptive learning rates for orthogonalized momentum updates

Combining AdaGrad stepsizes with orthogonal update directions

Preserving orthogonality while adapting to optimization landscape

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines AdaGrad stepsize with orthogonal updates

Preserves orthogonality while adapting to optimization landscape

Requires minimal modification with single additional scalar variable

🔎 Similar Papers

Multiple importance sampling for stochastic gradient estimation

2024-07-22arXiv.orgCitations: 0

Waymo

$170,000—$216,000 USD

Mountain View, California, United States / San Francisco, California, United States / Mountain View (US-MTV-EMF680), Mountain View, California, United States

Research Engineer, Monetization AI