Controlled LLM Training on Spectral Sphere

📅 2026-01-13
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Existing large-model optimizers struggle to simultaneously stabilize both weights and their updates, often leading to issues such as activation blowup, slow convergence, and imbalanced expert utilization in Mixture-of-Experts (MoE) architectures. This work proposes the Spectral Sphere Optimizer (SSO), which introduces, for the first time, module-wise joint spectral constraints on both weights and their updates, rigorously aligning with Maximal Update Parametrization (μP) to ensure optimization stability. SSO is derived from the steepest descent direction on the spectral sphere and integrates seamlessly into the Megatron framework, supporting Dense, MoE, and DeepNet architectures. Experiments demonstrate that SSO consistently outperforms AdamW and Muon across a 1.7B Dense model, an 8B-A1B MoE model, and a 200-layer DeepNet, effectively suppressing anomalous activations, improving routing balance, and enhancing training stability and scalability.

Technology Category

Application Category

📝 Abstract
Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned''with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Optimization
Maximal Update Parametrization
Training Stability
Spectral Constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spectral Sphere Optimizer
Maximal Update Parametrization
spectral constraints
large language model training
optimizer alignment
🔎 Similar Papers
No similar papers found.