TrasMuon: Trust-Region Adaptive Scaling for Orthogonalized Momentum Optimizers

📅 2026-02-13

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the sensitivity of Muon-like optimizers to step size and their vulnerability to high-energy burst updates, which stem from the loss of magnitude information during orthogonalization. Building upon the near-isometric geometric properties of these optimizers, the authors propose three key improvements: an efficient orthogonalization via Newton–Schulz iteration, a global RMS-based calibration mechanism for adaptive scaling of update magnitudes, and a trust-region strategy grounded in relative energy ratios to suppress anomalous updates. The resulting optimizer eliminates the need for learning rate warmup and demonstrates significantly enhanced training stability, robustness, and faster convergence across both vision and language models.

Technology Category

Application Category

📝 Abstract

Muon-style optimizers leverage Newton-Schulz (NS) iterations to orthogonalize updates, yielding update geometries that often outperform Adam-series methods. However, this orthogonalization discards magnitude information, rendering training sensitive to step-size hyperparameters and vulnerable to high-energy bursts. To mitigate this, we introduce TrasMuon (\textbf{T}rust \textbf{R}egion \textbf{A}daptive \textbf{S}caling \textbf{Muon}). TrasMuon preserves the near-isometric geometry of Muon while stabilizing magnitudes through (i) global RMS calibration and (ii) energy-based trust-region clipping. We demonstrate that while reintroducing adaptive scaling improves optimization efficiency, it typically exacerbates instability due to high-energy outliers. TrasMuon addresses this by defining a trust region based on relative energy ratios, confining updates to a stable zone. Empirical experiments on vision and language models demonstrate that TrasMuon converges faster than baselines. Furthermore, experiments without warmup stages confirm TrasMuon's superior stability and robustness.

Problem

Research questions and friction points this paper is trying to address.

orthogonalization

magnitude information

step-size sensitivity

high-energy bursts

training instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Trust-Region

Adaptive Scaling

Orthogonalized Momentum