🤖 AI Summary
Adaptive learning rate scheduling remains a fundamental challenge in deep learning optimization. This work addresses the online learning rate adaptation mechanism based on cumulative path length, identifying a theoretical inconsistency in its original formulation when applied to Adam: preconditioning distorts path length estimation, violating its geometric interpretation. To resolve this, we propose a modified path length definition aligned with Adam’s update dynamics and introduce a time-discounted normalization of gradient sequences to robustly estimate observed path length—ensuring comparability with the expected path length of a random walk. The resulting method is unified across both SGD and Adam, enhancing both theoretical soundness and empirical robustness. Experiments across diverse tasks—including image classification and language modeling—demonstrate that the corrected adaptive strategy consistently accelerates convergence, particularly under non-stationary loss landscapes.
📝 Abstract
The learning rate is a crucial hyperparameter in deep learning, with its ideal value depending on the problem and potentially changing during training. In this paper, we investigate the practical utility of adaptive learning rate mechanisms that adjust step sizes dynamically in response to the loss landscape. We revisit a cumulative path-based adaptation scheme proposed in 2017, which adjusts the learning rate based on the discrepancy between the observed path length, computed as a time-discounted sum of normalized gradient steps, and the expected length of a random walk. While the original approach offers a compelling intuition, we show that its adaptation mechanism for Adam is conceptually inconsistent due to the optimizer's internal preconditioning. We propose a corrected variant that better reflects Adam's update dynamics. To assess the practical value of online learning rate adaptation, we benchmark SGD and Adam, with and without cumulative adaptation, and compare them to a recent alternative method. Our results aim to clarify when and why such adaptive strategies offer practical benefits.