๐ค AI Summary
This work addresses the sensitivity of Muon-like optimizers to step size and their vulnerability to high-energy burst updates, which stem from the loss of magnitude information during orthogonalization. Building upon the near-isometric geometric properties of these optimizers, the authors propose three key improvements: an efficient orthogonalization via NewtonโSchulz iteration, a global RMS-based calibration mechanism for adaptive scaling of update magnitudes, and a trust-region strategy grounded in relative energy ratios to suppress anomalous updates. The resulting optimizer eliminates the need for learning rate warmup and demonstrates significantly enhanced training stability, robustness, and faster convergence across both vision and language models.
๐ Abstract
Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.