LionMuon: Alternating Spectral and Sign Descent for Efficient Training

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses a key challenge in large-scale optimization: achieving efficient update directions while substantially reducing per-iteration computational cost. We propose LionMuon, a novel optimizer that alternates periodically between Lion’s sign-based updates and Muon’s spectral matrix sign updates, sharing a single dual exponential moving average (EMA) momentum buffer. This design achieves Muon-level optimization performance at a computational cost comparable to sign-based methods, with only half the memory footprint of AdamW. Empirical evaluations across models ranging from 124M to 720M parameters demonstrate that LionMuon consistently attains lower validation loss than Muon, Lion, Signum, and AdamW on all tested datasets and architectures, while requiring less computational resources.

📝 Abstract

In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon's spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion's and Muon's updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW's. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy-tailed noise which are governed by period-averaged smoothness and noise that interpolate between Muon's and Lion's constants. These bounds predict the compute-optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: https://github.com/brain-lab-research/lion-muon

Problem

Research questions and friction points this paper is trying to address.

large-scale optimization

efficient training

sign-based optimizers

spectral updates

iteration cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

alternating optimization

sign-based descent

spectral matrix-sign