🤖 AI Summary
This work uncovers the theoretical mechanism behind the "edge of stability" (EoS) phenomenon in non-Euclidean gradient descent, where the largest eigenvalue of the Hessian approaches \(2/\eta\) during training, thereby violating classical smoothness assumptions. To address this, the authors introduce the notion of directional smoothness, extending EoS to non-Euclidean optimization methods under arbitrary norms, and propose a geometry-aware generalized sharpness measure. This framework unifies, for the first time, diverse algorithms including \(\ell^\infty\)-descent, block coordinate descent, spectral gradient descent, and momentum-free Muon. Empirical results confirm that various non-Euclidean gradient descent variants consistently exhibit an initial sharpness increase followed by oscillations near \(2/\eta\), validating the effectiveness and universality of the proposed generalized sharpness metric.
📝 Abstract
The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/\eta$ during training with gradient descent (GD) with a step-size $\eta$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/\eta$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.