Practical Efficiency of Muon for Pretraining

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the longstanding challenge of jointly optimizing data and computational efficiency in large-batch training. We propose a synergistic optimization framework integrating the Muon second-order optimizer with maximum update parameterization (muP). A novel telescoping hyperparameter transfer algorithm is introduced, enabling controlled-error, efficient parameter scaling far beyond the critical batch size. To our knowledge, this is the first empirical validation of Muon+muP’s sustained effectiveness in pretraining billion-parameter models—specifically, a 4-billion-parameter model. Compared to AdamW, our method reduces training cost by over 30% under identical compute budgets while maintaining or improving downstream task performance. These gains substantially advance the Pareto frontier of the compute–time trade-off in large-scale language model training.

Technology Category

Application Category

📝 Abstract

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Problem

Research questions and friction points this paper is trying to address.

Expands Pareto frontier over AdamW for compute-time tradeoff

Retains data efficiency at large batch sizes beyond critical

Combines Muon with muP for efficient hyperparameter transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Muon outperforms AdamW on compute-time tradeoff

Muon retains data efficiency at large batch sizes

Combines Muon with muP for hyperparameter transfer

🔎 Similar Papers

No similar papers found.