RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the high computational cost of preconditioning methods in deep neural network training by proposing the RMNP optimizer. RMNP introduces, for the first time, a lightweight preconditioning strategy that combines row-wise ℓ² normalization with momentum, leveraging the diagonal block structure of the Hessian matrix in Transformer layers to circumvent expensive Newton–Schulz iterations. This approach reduces the preconditioning computational complexity to O(mn), substantially decreasing training time in large language model pretraining while achieving optimization performance comparable to that of Muon. Furthermore, RMNP provides theoretical convergence guarantees under non-convex optimization settings.

Technology Category

Application Category

📝 Abstract

Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{Muon} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textsc{Muon} still leaves room for further improvement. In this paper, we introduce \textsc{RMNP} (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.

Problem

Research questions and friction points this paper is trying to address.

preconditioning

computational efficiency

matrix optimization

deep neural networks

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

preconditioning

row-wise normalization

computational efficiency