MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work addresses the lack of theoretical generalization guarantees for the MuOn optimizer in large-scale models and its performance degradation when singular values of gradient estimates exhibit small disparities. By leveraging algorithmic stability and mathematical induction, the study establishes the first generalization error bound for MuOn and introduces a novel hybrid optimizer, MiMuon. MiMuon integrates a gradient orthogonalization mechanism to synergistically combine the strengths of MuOn and momentum SGD, significantly enhancing generalization while preserving convergence speed. Theoretically, MiMuon reduces the generalization error to O(1/N), overcoming the original method’s sensitivity to small condition numbers κ, and achieves a convergence rate of O(1/T^{1/4}). Empirical validation on large models such as Qwen3-0.6B and YOLO26m confirms its effectiveness.

📝 Abstract

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

Problem

Research questions and friction points this paper is trying to address.

generalization error

Muon optimizer

matrix-structured parameters

large models

algorithmic stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

MiMuon

generalization error

algorithmic stability