🤖 AI Summary
This work investigates how optimization algorithms influence feature learning during model fine-tuning, with a particular focus on saddle-point escape as a critical bottleneck. To this end, the authors propose a unified theoretical framework—steepest descent mirror flow—combined with diagonal linear networks and deep diagonal reparameterization to systematically analyze how optimization geometry shapes learning dynamics, implicit bias, and sparsity. Both theoretical and empirical results demonstrate that steepest descent (including sign-based variants) effectively facilitates saddle-point escape and enhances feature learning, whereas SGD requires unrealistically large learning rates to achieve similar effects. Furthermore, Adam/AdamW significantly improve the stability of feature learning due to a novel equilibrium induced by decoupled weight decay, revealing an intrinsic mechanism underlying their superior performance over SGD.
📝 Abstract
How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.