RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of preconditioning methods in deep neural network training by proposing the RMNP optimizer. RMNP introduces, for the first time, a lightweight preconditioning strategy that combines row-wise ℓ² normalization with momentum, leveraging the diagonal block structure of the Hessian matrix in Transformer layers to circumvent expensive Newton–Schulz iterations. This approach reduces the preconditioning computational complexity to O(mn), substantially decreasing training time in large language model pretraining while achieving optimization performance comparable to that of Muon. Furthermore, RMNP provides theoretical convergence guarantees under non-convex optimization settings.

Technology Category

Application Category

📝 Abstract
Preconditioned adaptive methods have gained significant attention for training deep neural networks, as they capture rich curvature information of the loss landscape . The central challenge in this field lies in balancing preconditioning effectiveness with computational efficiency of implementing the preconditioner. Among recent advances, \textsc{Muon} stands out by using Newton-Schulz iteration to obtain preconditioned updates without explicitly constructing the preconditioning matrix. Despite its advantages, the efficiency of \textsc{Muon} still leaves room for further improvement. In this paper, we introduce \textsc{RMNP} (Row Momentum Normalized Preconditioning), an optimizer that replaces Newton-Schulz iteration with a simple row-wise $\ell_2$ normalization operation, motivated by the empirically observed diagonal block structure of the Transformer layerwise Hessian. This substitution reduces the per-iteration computational complexity from $\mathcal{O}(mn\cdot\min(m,n))$ to $\mathcal{O}(mn)$ for an $m\times n$ weight matrix while maintaining comparable optimization performance. Theoretically, we establish convergence guarantees for \textsc{RMNP} in the non-convex setting that match recent results for \textsc{Muon} optimizers, achieving the information-theoretic minimax optimal complexity. Extensive experiments on large language model pretraining show that \textsc{RMNP} delivers competitive optimization performance compared with \textsc{Muon} while substantially reducing preconditioning wall-clock time. Our code is available at \href{https://anonymous.4open.science/r/RMNP-E8E1/}{this link}.
Problem

Research questions and friction points this paper is trying to address.

preconditioning
computational efficiency
matrix optimization
deep neural networks
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

preconditioning
row-wise normalization
computational efficiency
non-convex optimization
large language models
🔎 Similar Papers
No similar papers found.