OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

201K/year
πŸ€– AI Summary
This work addresses the lack of layer-adaptive update mechanisms in the MuOn optimizer by proposing OrScale and its language model variant, OrScale-LM. These methods introduce a layer-wise trust ratio based on the Frobenius norm of the parameter update direction to adaptively scale orthogonalized matrix updates. This approach effectively mitigates shape degradation, momentum clipping saturation, and uncontrolled weight decay inherent in hybrid MuOn–LAMB strategies, while enabling muP-style learning rate transferability. By integrating orthogonal updates, coupled weight decay, Moonlight-shaped scaling, and single-pass layer calibration, OrScale achieves 94.05% accuracy on CIFAR-10 using DavidNet. In FineWeb-Edu pretraining, OrScale-LM consistently outperforms MuOn+Moonlight across model scales from 125M to 1.1B parameters and surpasses AdamW comprehensively.
πŸ“ Abstract
Muon improves neural-network training by orthogonalizing matrix-valued updates, but it leaves each layer's update magnitude controlled mostly by a global learning rate. We introduce OrScale, a trust-ratio extension of Muon built on a simple rule: the denominator of a layer-wise ratio should measure the Frobenius norm of the actual parameter-space direction that will be applied. This yields OrScale for general matrix layers and OrScale-LM for language models, where Moonlight shape scaling is combined with one-time per-layer calibration so every trust ratio starts at one. We analyze why three natural Muon-LAMB hybrids fail through shape-degenerate denominators, raw-momentum clip saturation, and decoupled weight-decay runaway, and show that the real-update-direction denominator with coupled weight decay avoids these failures. Theoretically, OrScale admits an O(1/sqrt(T)) nonconvex convergence guarantee in a nuclear-norm criterion, a strict layer-adaptive descent gain under measurable layer heterogeneity, and calibration properties that preserve muP-style learning-rate transfer at initialization. Empirically, OrScale ranks first on CIFAR-10/DavidNet across three seeds, improving Muon from 93.70% to 94.05% validation top-1, and OrScale-LM improves FineWeb-Edu pre-training versus Muon+Moonlight at three of four scales from 125M to 1.1B parameters while outperforming AdamW at every scale.
Problem

Research questions and friction points this paper is trying to address.

layer-wise scaling
trust ratio
orthogonal optimization
weight decay
learning rate adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

OrScale
trust-ratio scaling
orthogonalized optimization
layer-wise adaptation
coupled weight decay
πŸ”Ž Similar Papers