NorMuon: Making Muon more efficient and scalable

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Muon optimizers improve the optimization geometry of LLM training via orthogonalization but induce highly imbalanced neuron-wise update norms, leading to parameter update instability. This work is the first to identify and address this issue, proposing NorMuon: an extension of Muon that introduces neuron-level adaptive learning rates, integrating neuron-wise second-moment estimation with row-wise normalization to equalize update magnitudes. Additionally, we design a Fully Sharded Data Parallel v2 (FSDP2)-compatible distributed orthogonal computation scheme to ensure scalability. Experiments on 1.1B-parameter model pretraining show that NorMuon achieves 21.74% higher training efficiency than Adam and 11.31% higher than Muon, with comparable memory overhead. The method significantly enhances optimization stability and scalability for large-model training.

Technology Category

Application Category

📝 Abstract

The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon's emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon's conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

Problem

Research questions and friction points this paper is trying to address.

Combining orthogonalization with adaptive learning rates for LLM optimization

Addressing imbalanced neuron norms in Muon optimizer updates

Developing scalable distributed implementation for efficient model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines orthogonalization with neuron-level adaptive learning rates

Applies row-wise normalization after orthogonalization for balance

Distributes orthogonalization computations across devices efficiently

🔎 Similar Papers

No similar papers found.