🤖 AI Summary
This work elucidates the design principle of the Muon optimizer from the perspective of implicit Newton methods, revealing that it corresponds to an approximate Newton update that neglects right preconditioning by the input second-moment matrix. Building upon this insight, we construct a quadratic surrogate model relying solely on gradients, the output-space curvature matrix, and the input data matrix. By leveraging matrix perturbation analysis, singular value decomposition, and an isotropic weight assumption, we derive a closed-form update rule and propose Newton-Muon, a novel optimizer that explicitly incorporates input second-moment information. Empirical evaluation on GPT-2 pretraining demonstrates that Newton-Muon reduces the required optimization steps by 6% compared to the original Muon, translating to an approximately 4% decrease in wall-clock training time.
📝 Abstract
The Muon optimizer has received considerable attention for its strong performance in training large language models, yet the design principle behind its matrix-gradient orthogonalization remains largely elusive. In this paper, we introduce a surrogate model that not only sheds new light on the design of Muon, but more importantly leads to a new optimizer. In the same spirit as the derivation of Newton's method, the surrogate approximates the loss as a quadratic function of the perturbation to a weight matrix $W$ using only three matrices: the gradient $G$, an output-space curvature matrix $H$, and the data matrix $Z$ that stacks the layer inputs. By minimizing this surrogate in one step and adopting a certain isotropic assumption on the weights, we obtain the closed-form update rule (up to momentum and weight decay) $W \leftarrow W - η\cdot \mathrm{msgn}(G(ZZ^\top)^{-1})$, where $η$ is the learning rate and $\mathrm{msgn}(X)=UV^\top$ if $X=USV^\top$ is a compact singular value decomposition. This new optimization method, which we refer to as Newton-Muon, shows that standard Muon can be interpreted as an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, on a reproduction of the earliest publicly released Modded-NanoGPT speedrun configuration using Muon for GPT-2 pretraining, Newton-Muon reaches the target validation loss in 6\% fewer iteration steps and reduces wall-clock training time by about 4\%.