🤖 AI Summary
This work addresses the challenge of directly applying the stochastic Polyak stepsize (SPS) to nonsmooth optimization—particularly in nonconvex settings—where standard SPS fails due to numerical instability and lack of theoretical guarantees. We propose SPS$_{safe}$, a stochastic subgradient method incorporating both a safety mechanism and momentum acceleration. Unlike prior SPS variants, SPS$_{safe}$ requires no interpolation assumption, knowledge of the optimal objective value, or strong convexity, thus extending SPS for the first time to general nonsmooth convex and nonconvex optimization. Theoretically, under standard subgradient boundedness assumptions, SPS$_{safe}$ enjoys stable convergence. Empirically, it effectively suppresses stepsize oscillations and numerical instability arising from small gradients, significantly reducing iteration variance. In deep neural network training, SPS$_{safe}$ mitigates vanishing gradients, achieves faster convergence, and demonstrates superior robustness compared to mainstream adaptive methods such as Adam and AdaGrad.
📝 Abstract
The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size accelerates convergence, reduces variance, and consistently outperforms existing adaptive baselines. Finally, in the context of deep neural network training, our method demonstrates robust performance by addressing the vanishing gradient problem.