TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

๐Ÿ“… 2026-02-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the sensitivity of Muon-like optimizers to step size and their vulnerability to high-energy burst updates, which stem from the loss of magnitude information during orthogonalization. Building upon the near-isometric geometric properties of these optimizers, the authors propose three key improvements: an efficient orthogonalization via Newtonโ€“Schulz iteration, a global RMS-based calibration mechanism for adaptive scaling of update magnitudes, and a trust-region strategy grounded in relative energy ratios to suppress anomalous updates. The resulting optimizer eliminates the need for learning rate warmup and demonstrates significantly enhanced training stability, robustness, and faster convergence across both vision and language models.

Technology Category

Application Category

๐Ÿ“ Abstract
Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.
Problem

Research questions and friction points this paper is trying to address.

orthogonalization
magnitude information
step-size sensitivity
high-energy bursts
training instability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust-Region
Adaptive Scaling
Orthogonalized Momentum
Newton-Schulz Iteration
Optimizer Stability
๐Ÿ”Ž Similar Papers
No similar papers found.
P
Peng Cheng
Huawei Canadian Research Institute, Canada
J
Jiucheng Zang
Department of Combinatorics and Optimization, University of Waterloo, Waterloo, Canada.
Q
Qingnan Li
Huawei Canadian Research Institute, Canada
Liheng Ma
Liheng Ma
PhD student, McGill University & Mila.
Geometric Deep LearningGraph Neural NetworksTime SeriesMachine Learning
Yufei Cui
Yufei Cui
McGill University, MILA
Medical AIRAGLLM AgentPredictive Uncertainty
Yingxue Zhang
Yingxue Zhang
Huawei
Graph representation learningGraph ReasoningLLMs ReasoningKnowledge GraphsRecommender Systems
Boxing Chen
Boxing Chen
Huawei Technologies Canada
Natual Language ProcessingArtificial Intelligence
M
Ming Jian
Huawei Canadian Research Institute, Canada
W
Wen Tong
Huawei Canadian Research Institute, Canada