Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

📅 2024-02-04

📈 Citations: 1

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Understanding how momentum mitigates stochastic fluctuations in SGD and improves generalization in deep neural network training remains theoretically underexplored. Method: We reformulate optimization noise as the angular deviation between the stochastic gradient and the true steepest-descent direction—departing from conventional variance-based definitions—and develop a stochastic optimization dynamics framework that jointly models directional noise, learning rate, batch size, and local landscape smoothness. Contribution/Results: We establish the first theoretical link between directional gradient deviation and generalization error; prove that momentum enhances directional smoothing of optimization noise, thereby regulating local smoothness of the loss landscape in concert with learning rate and batch size; derive rigorous generalization bounds for SGD with momentum; and empirically validate a strong negative correlation between directional noise magnitude and test accuracy across diverse architectures and datasets. This work provides a novel theoretical foundation and empirical validation for the generalization benefits of momentum.

Technology Category

Application Category

📝 Abstract

For nonconvex objective functions, including deep neural networks, stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, but a theoretical explanation for this is lacking. In contrast to previous studies that defined the stochastic noise that occurs during optimization as the variance of the stochastic gradient, we define it as the gap between the search direction of the optimizer and the steepest descent direction and show that its level dominates generalizability of the model. We also show that the stochastic noise in SGD with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. By numerically deriving the stochastic noise level in SGD and SGD with momentum, we provide theoretical findings that help explain the training dynamics of SGD with momentum, which were not explained by previous studies on convergence and stability. We also provide experimental results supporting our assertion that model generalizability depends on the stochastic noise level.

Problem

Research questions and friction points this paper is trying to address.

Momentum Techniques

Deep Neural Networks

Stochastic Gradient Descent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Momentum Reevaluation

Function Smoothing

SGD Performance

🔎 Similar Papers

No similar papers found.