Gluon: Making Muon&Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing LMO-based optimizers (e.g., Muon, Scion) suffer from a theory-practice gap in large language model (LLM) training: their convergence analyses ignore the layer-wise independent invocation structure and rely on unrealistic global smoothness assumptions, yielding overly conservative theoretical step sizes incompatible with empirical LLM training. Method: We propose Gluon—a unified framework that establishes the first convergence theory aligned with practical layer-wise LMO usage. It introduces a generalized local smoothness assumption, capturing per-layer geometric properties of neural networks, thereby unifying Muon and Scion and exposing their implicit geometric adaptivity. Results: The theoretically derived step sizes closely match empirically tuned values across benchmarks (NanoGPT, CNN), and the new assumption remains valid throughout optimization. Gluon bridges the long-standing gap between theoretical guarantees and engineering performance for LMO-based optimizers.

Technology Category

Application Category

📝 Abstract

Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $sf Muon$ and $sf Scion$. After over a decade of $sf Adam$'s dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based method called $sf Gluon$, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of $sf Muon$ and $sf Scion$, and leads to convergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values reported by Pethick et al. (2025). Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.

Problem

Research questions and friction points this paper is trying to address.

Bridging theory-practice gap in LMO-based optimizers for LLMs

Addressing unrealistic smoothness assumptions in prior analyses

Improving layer-wise LMO application and convergence guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Gluon as a new LMO-based optimizer

Introduces refined generalized smoothness model

Ensures convergence with practical stepsizes

🔎 Similar Papers

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning