🤖 AI Summary
This study investigates the implicit bias of momentum-based optimizers—such as Adam and Muon—in smooth homogeneous neural networks, revealing their preference for specific norm-based margin maximization problems. By extending the normalized steepest descent analysis framework to momentum methods and incorporating dynamic learning rate modeling with Karush–Kuhn–Tucker (KKT) conditions, the authors theoretically establish that these optimizers converge to KKT points of either ℓ∞-norm or mixed-norm margin maximization problems. Experimental results corroborate the theoretical predictions, demonstrating that the choice of optimizer directly governs the direction of model generalization. This work provides the first explicit correspondence between mainstream momentum optimizers and margin maximization criteria, offering a principled understanding of how optimization dynamics shape inductive bias in deep learning.
📝 Abstract
We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.