๐ค AI Summary
This work addresses the sensitivity of Muons and related normalized optimizers to step-size scaling, which significantly impacts their practical performance. The authors propose three adaptive step-size scaling algorithms that dynamically adjust either the trust-region radius or the step size under non-convex and star-convex settings, thereby reducing reliance on manual hyperparameter tuning. Notably, this is the first study to establish an adaptive scaling mechanism for Muons within general normed geometries and to provide a theoretical guarantee on the objective suboptimality of the final iterate under star-convexity. The proposed methods integrate trajectory-based exploration distance, local descent certificates, scalar distance certificates, and one-dimensional line search to enable adaptive step-size selection. Empirical results demonstrate substantially reduced sensitivity to step-size tuning on GPT-124M/WikiText-103 and ViT-Tiny/CIFAR-100, with performance matching or surpassing fixed step-size baselines.
๐ Abstract
Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon's exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.