Lions and Muons: Optimization via Stochastic Frank-Wolfe

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing deep learning optimizers (e.g., Lion, Muon) lack a unified theoretical foundation, particularly under non-convex constrained settings and heavy-tailed gradient noise. Method: We establish the first formal connection between these optimizers and the Stochastic Frank–Wolfe (SFW) algorithm, unifying them as weighted-decay variants of SFW. We further propose a robust SFW variant—incorporating gradient clipping and momentum coupling—to provably mitigate heavy-tailed noise. Contribution/Results: We provide the first rigorous convergence guarantee to KKT points for such optimizers under non-convex constraints. Our robust SFW achieves the optimal convergence rate in terms of Frank–Wolfe gap under heavy-tailed stochastic gradients. Empirical evaluation demonstrates substantial improvements in optimization stability and generalization across ViT and LLM training, validating both theoretical insights and practical efficacy.

Technology Category

Application Category

📝 Abstract

Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees, which in turn yields new variants of Lion and Muon.

Problem

Research questions and friction points this paper is trying to address.

Unifying Lion and Muon optimizers via Stochastic Frank-Wolfe

Establishing convergence guarantees for Lion and Muon

Extending Stochastic Frank-Wolfe for heavy-tailed noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies Lion and Muon via Stochastic Frank-Wolfe

Extends Stochastic Frank-Wolfe for heavy-tailed noise

Establishes convergence guarantees for KKT points

🔎 Similar Papers

Unsupervised Machine Learning Hybrid Approach Integrating Linear Programming in Loss Function: A Robust Optimization Technique