Online Learning-guided Learning Rate Adaptation via Gradient Alignment

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual learning rate tuning in deep learning optimization remains labor-intensive and non-adaptive. Method: This paper proposes GALA, a framework that formulates learning rate selection as a one-dimensional online learning problem. GALA dynamically adjusts the learning rate by jointly estimating gradient alignment and local curvature in real time. It integrates gradient alignment metrics with the Follow-the-Regularized-Leader (FTRL) online algorithm for the first time. Contribution/Results: GALA provides provable convergence guarantees for non-convex, smooth objectives under data-adaptive conditions—without requiring pre-specified hyperparameters. It is plug-and-play compatible with standard optimizers such as SGD and Adam, achieves robust high performance across a wide range of initial learning rates, and matches or exceeds finely tuned baselines without any manual hyperparameter search.

Technology Category

Application Category

📝 Abstract
The performance of an optimizer on large-scale deep learning models depends critically on fine-tuning the learning rate, often requiring an extensive grid search over base learning rates, schedules, and other hyperparameters. In this paper, we propose a principled framework called GALA (Gradient Alignment-based Learning rate Adaptation), which dynamically adjusts the learning rate by tracking the alignment between consecutive gradients and using a local curvature estimate. Guided by the convergence analysis, we formulate the problem of selecting the learning rate as a one-dimensional online learning problem. When paired with an online learning algorithm such as Follow-the-Regularized-Leader, our method produces a flexible, adaptive learning rate schedule that tends to increase when consecutive gradients are aligned and decrease otherwise. We establish a data-adaptive convergence rate for normalized SGD equipped with GALA in the smooth, nonconvex setting. Empirically, common optimizers such as SGD and Adam, when augmented with GALA, demonstrate robust performance across a wide range of initial learning rates and perform competitively without the need for tuning.
Problem

Research questions and friction points this paper is trying to address.

Dynamic learning rate adaptation via gradient alignment
Eliminates extensive hyperparameter tuning for optimizers
Improves optimizer performance across varying initial rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic learning rate adjustment via gradient alignment
Online learning-guided local curvature estimation
Flexible adaptive schedule without manual tuning
🔎 Similar Papers
No similar papers found.