🤖 AI Summary
This work addresses the geometric mismatch arising from mainstream optimizers like Adam, which disregard the symmetry and equivariance structures inherent in neural network parameter spaces. The authors propose a principled framework for designing symmetry-compatible optimizers, tailoring gradient update rules to respect the equivariance of specific architectural components—such as embedding layers, language model heads, SwiGLU projections, and MoE routers—and assembling them into an end-to-end hierarchical optimizer stack. For the first time, this approach is systematically applied beyond generic matrix layers, encompassing permutation and shared translational symmetries, thereby unifying and extending equivariant optimization methods. Efficient compatibility is achieved through techniques including one-sided spectral updates, row/column-aware normalization, and centering. In pretraining both dense and sparse MoE language models, the proposed optimizer consistently outperforms AdamW, yielding lower validation loss and enhanced training stability.
📝 Abstract
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.