Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Momentum-based optimizers suffer from gradient computation lag and insufficient lookahead in parameter updates. Method: This paper proposes Overshoot, the first optimizer to evaluate gradients at a predicted parameter location—shifted forward along the momentum direction—departing from the conventional paradigm of computing gradients exclusively at the current parameters. Built upon SGD with Momentum and Adam frameworks, Overshoot introduces only a single-step forward shift in gradient estimation, incurring no additional memory overhead and requiring just one lightweight extra forward/backward pass. Contribution/Results: Experiments across diverse tasks demonstrate that Overshoot consistently outperforms standard momentum and Nesterov momentum, reducing required training steps by over 15% on average while accelerating convergence. It maintains full compatibility with mainstream optimizers and infrastructure, enabling seamless integration into existing deep learning pipelines.

Technology Category

Application Category

📝 Abstract

Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov's momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.

Problem

Research questions and friction points this paper is trying to address.

Momentum-based Optimization

Machine Learning Efficiency

Convergence Acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Overshoot

Momentum-based Optimization

Gradient Prediction

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks