π€ AI Summary
Existing optimizers struggle to simultaneously achieve efficiency, stability, and low computational overhead in large language model training. This work proposes Nora, a novel optimizer that explicitly stabilizes the evolution of weight norms and angles by projecting row-wise momentum onto the orthogonal complement of the weight matrix. Nora leverages the block-diagonal dominance structure of the Transformer Hessian to construct an efficient preconditioning approximation. It is the first optimizer to unify strict scale invariance, effective preconditioning, and linear computational complexity ($\mathcal{O}(mn)$), requiring only two lines of code for integration. Theoretical analysis establishes a scaling theorem for scalable optimizers, and experiments demonstrate Noraβs superior stability and computational efficiency in large-scale training scenarios.
π Abstract
Matrix-based optimizers have demonstrated immense potential in training Large Language Models (LLMs), however, designing an ideal optimizer remains a formidable challenge. A superior optimizer must satisfy three core desiderata: efficiency, achieving Muon-like preconditioning to accelerate optimization; stability, strictly adhering to the scale-invariance inherent in neural networks; and speed, minimizing computational overhead. While existing methods address these aspects to varying degrees, they often fail to unify them, either incurring prohibitive computational costs like Muon, or allowing radial jitters that compromise stability like RMNP. To bridge this gap, we propose Nora, an optimizer that rigorously satisfies all three requirements. Nora achieves training stability by explicitly stabilizing weight norms and angular velocities through row-wise momentum projection onto the orthogonal complement of the weights. Simultaneously, by leveraging the block-diagonal dominance of the Transformer Hessian, Nora effectively approximates structured preconditioning while maintaining an optimal computational complexity of $\mathcal{O}(mn)$. Furthermore, we prove that Nora is a scalable optimizer and establish its corresponding scaling theorems. With a streamlined implementation requiring only two lines of code, our preliminary experiments validate Nora as an efficient and highly promising optimizer for large-scale training.