TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

๐Ÿ“… 2026-02-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

235K/year
๐Ÿค– AI Summary
This work addresses the sensitivity of Muon-like optimizers to step size and their vulnerability to high-energy burst updates, which stem from the loss of magnitude information during orthogonalization. Building upon the near-isometric geometric properties of these optimizers, the authors propose three key improvements: an efficient orthogonalization via Newtonโ€“Schulz iteration, a global RMS-based calibration mechanism for adaptive scaling of update magnitudes, and a trust-region strategy grounded in relative energy ratios to suppress anomalous updates. The resulting optimizer eliminates the need for learning rate warmup and demonstrates significantly enhanced training stability, robustness, and faster convergence across both vision and language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.
Problem

Research questions and friction points this paper is trying to address.

orthogonalization
magnitude information
step-size sensitivity
high-energy bursts
training instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust-Region
Adaptive Scaling
Orthogonalized Momentum
Newton-Schulz Iteration
Optimizer Stability
๐Ÿ”Ž Similar Papers
No similar papers found.