Optimistic Dual Averaging Unifies Modern Optimizers

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

243K/year
🤖 AI Summary
This work addresses the lack of a unified theoretical framework among mainstream optimizers and the cumbersome hyperparameter tuning required for weight decay. We propose SODA, a novel optimization framework that reveals, for the first time, that advanced optimizers such as Muon and Lion are special cases of Optimistic Dual Averaging. Building on this insight, we construct a general-purpose optimizer wrapper grounded in rigorous theory. Furthermore, we introduce a theoretically justified adaptive weight decay mechanism employing a 1/k decay schedule, eliminating the need for additional hyperparameter tuning. Extensive experiments demonstrate that SODA consistently enhances performance across diverse model scales and training durations, achieving both theoretical unification and practical efficacy.
📝 Abstract
We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.
Problem

Research questions and friction points this paper is trying to address.

optimizers
weight decay
hyperparameter tuning
unified framework
optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic Dual Averaging
SODA
optimizer unification
weight decay scheduling
hyperparameter-free optimization
🔎 Similar Papers
No similar papers found.