Overshoot: Taking advantage of future gradients in momentum-based stochastic optimization

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Momentum-based optimizers suffer from gradient computation lag and insufficient lookahead in parameter updates. Method: This paper proposes Overshoot, the first optimizer to evaluate gradients at a predicted parameter location—shifted forward along the momentum direction—departing from the conventional paradigm of computing gradients exclusively at the current parameters. Built upon SGD with Momentum and Adam frameworks, Overshoot introduces only a single-step forward shift in gradient estimation, incurring no additional memory overhead and requiring just one lightweight extra forward/backward pass. Contribution/Results: Experiments across diverse tasks demonstrate that Overshoot consistently outperforms standard momentum and Nesterov momentum, reducing required training steps by over 15% on average while accelerating convergence. It maintains full compatibility with mainstream optimizers and infrastructure, enabling seamless integration into existing deep learning pipelines.

Technology Category

Application Category

📝 Abstract
Overshoot is a novel, momentum-based stochastic gradient descent optimization method designed to enhance performance beyond standard and Nesterov's momentum. In conventional momentum methods, gradients from previous steps are aggregated with the gradient at current model weights before taking a step and updating the model. Rather than calculating gradient at the current model weights, Overshoot calculates the gradient at model weights shifted in the direction of the current momentum. This sacrifices the immediate benefit of using the gradient w.r.t. the exact model weights now, in favor of evaluating at a point, which will likely be more relevant for future updates. We show that incorporating this principle into momentum-based optimizers (SGD with momentum and Adam) results in faster convergence (saving on average at least 15% of steps). Overshoot consistently outperforms both standard and Nesterov's momentum across a wide range of tasks and integrates into popular momentum-based optimizers with zero memory and small computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Momentum-based Optimization
Machine Learning Efficiency
Convergence Acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Overshoot
Momentum-based Optimization
Gradient Prediction