HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses a critical limitation of existing Muon optimizers in large language model training: their tendency to suppress the heavy-tailed nature of weight spectra, leading to updates overly biased toward noise-dominated directions. Drawing on heavy-tailed self-regularization theory, we propose the first heavy-tailed spectral correction within the Muon framework by introducing a steepest descent method constrained by the Schatten-q norm. This approach preserves Muonโ€™s parameter-dependent modeling capacity while yielding heavier-tailed updates and weight spectra. The method offers both theoretical convergence guarantees and plug-and-play compatibility with existing Muon variants. Empirical evaluations demonstrate consistent improvements over state-of-the-art baselines, achieving up to a 0.98 reduction in perplexity on the C4 dataset during LLaMA pretraining, as well as superior performance in image classification tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.
Problem

Research questions and friction points this paper is trying to address.

Muon
heavy-tailed
weight spectra
LLM training
orthogonalized update
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heavy-Tailed Spectral Correction
Muon optimizer
Schatten-q norm
Self-Regularization
LLM optimization
๐Ÿ”Ž Similar Papers
No similar papers found.