🤖 AI Summary
This work addresses the theoretical gap in understanding why SignSGD often outperforms SGD under standard settings, despite SGD being provably optimal in such regimes. The authors introduce a novel analytical framework based on $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, enabling the first minimax-matching upper and lower bounds for SignSGD and SGD in non-Euclidean geometry. Their analysis rigorously establishes that under sparse noise conditions, SignSGD achieves a convergence rate accelerated by a factor of the ambient dimension $d$ compared to SGD—a prediction empirically validated in the pretraining of a 124M-parameter GPT-2 model. The framework is further extended to the matrix setting, where it formally establishes the optimality of the Muon algorithm.
📝 Abstract
Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.