Optimistic Dual Averaging Unifies Modern Optimizers

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the lack of a unified theoretical framework among mainstream optimizers and the cumbersome hyperparameter tuning required for weight decay. We propose SODA, a novel optimization framework that reveals, for the first time, that advanced optimizers such as Muon and Lion are special cases of Optimistic Dual Averaging. Building on this insight, we construct a general-purpose optimizer wrapper grounded in rigorous theory. Furthermore, we introduce a theoretically justified adaptive weight decay mechanism employing a 1/k decay schedule, eliminating the need for additional hyperparameter tuning. Extensive experiments demonstrate that SODA consistently enhances performance across diverse model scales and training durations, achieving both theoretical unification and practical efficacy.

📝 Abstract

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

Problem

Research questions and friction points this paper is trying to address.

optimizers

weight decay

hyperparameter tuning

unified framework

optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimistic Dual Averaging

SODA

optimizer unification