LionMuon: Alternating Spectral and Sign Descent for Efficient Training

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses a key challenge in large-scale optimization: achieving efficient update directions while substantially reducing per-iteration computational cost. We propose LionMuon, a novel optimizer that alternates periodically between Lion’s sign-based updates and Muon’s spectral matrix sign updates, sharing a single dual exponential moving average (EMA) momentum buffer. This design achieves Muon-level optimization performance at a computational cost comparable to sign-based methods, with only half the memory footprint of AdamW. Empirical evaluations across models ranging from 124M to 720M parameters demonstrate that LionMuon consistently attains lower validation loss than Muon, Lion, Signum, and AdamW on all tested datasets and architectures, while requiring less computational resources.
📝 Abstract
In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon's spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy-tailed noise which are governed by period-averaged smoothness and noise that interpolate between Muon's and Lion's constants. These bounds predict the compute-optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: https://github.com/brain-lab-research/lion-muon
Problem

Research questions and friction points this paper is trying to address.

large-scale optimization
efficient training
sign-based optimizers
spectral updates
iteration cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

alternating optimization
sign-based descent
spectral matrix-sign
dual-EMA momentum
compute-efficient training
🔎 Similar Papers
No similar papers found.