🤖 AI Summary
This work addresses the slow convergence of existing LMO-based optimization methods, their high variance-reduction overhead, and the absence of a unified theoretical framework for both constrained and unconstrained settings. To overcome these limitations, we propose LMO-IGT, a novel algorithm that incorporates Implicit Gradient Transport (IGT) to accelerate convergence with only a single stochastic gradient evaluation per iteration, thereby establishing a unified stochastic LMO optimization framework. We further introduce the Regularized Support Function (RSF) as a new stationarity measure. Theoretically, LMO-IGT achieves an iteration complexity of 𝒪(ε⁻³·⁵), improving upon standard LMO (𝒪(ε⁻⁴)) and variance-reduced LMO (𝒪(ε⁻³)). Empirical results demonstrate that Muon-IGT significantly enhances performance with negligible additional computational cost.
📝 Abstract
Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles (LMOs). While variance reduction has been explored to accelerate LMO-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations. At the same time, the theoretical understanding of LMO-based methods remains fragmented across unconstrained and constrained formulations. Motivated by these limitations, we propose \emph{LMO-IGT}, a new class of stochastic LMO-based methods leveraging implicit gradient transport (IGT). We further introduce a unified framework for stochastic LMO-based optimization together with a new stationarity measure, the \emph{regularized support function} (RSF), which bridges gradient-norm and Frank--Wolfe-gap notions within a common framework. By evaluating stochastic gradients at transported points, LMO-IGT accelerates convergence while retaining the single-gradient-per-iteration structure of standard stochastic LMO. Our analysis establishes that stochastic LMO achieves an iteration complexity of $\mathcal{O}(\varepsilon^{-4})$, variance-reduced LMO achieves $\mathcal{O}(\varepsilon^{-3})$ at the cost of additional gradient evaluations, and LMO-IGT achieves $\mathcal{O}(\varepsilon^{-3.5})$ using only a single stochastic gradient per iteration. Empirically, LMO-IGT consistently improves over stochastic LMO counterparts with negligible overhead. Among its instantiations, Muon-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO-based optimization.