🤖 AI Summary
Understanding how momentum mitigates stochastic fluctuations in SGD and improves generalization in deep neural network training remains theoretically underexplored.
Method: We reformulate optimization noise as the angular deviation between the stochastic gradient and the true steepest-descent direction—departing from conventional variance-based definitions—and develop a stochastic optimization dynamics framework that jointly models directional noise, learning rate, batch size, and local landscape smoothness.
Contribution/Results: We establish the first theoretical link between directional gradient deviation and generalization error; prove that momentum enhances directional smoothing of optimization noise, thereby regulating local smoothness of the loss landscape in concert with learning rate and batch size; derive rigorous generalization bounds for SGD with momentum; and empirically validate a strong negative correlation between directional noise magnitude and test accuracy across diverse architectures and datasets. This work provides a novel theoretical foundation and empirical validation for the generalization benefits of momentum.
📝 Abstract
For nonconvex objective functions, including deep neural networks, stochastic gradient descent (SGD) with momentum has fast convergence and excellent generalizability, but a theoretical explanation for this is lacking. In contrast to previous studies that defined the stochastic noise that occurs during optimization as the variance of the stochastic gradient, we define it as the gap between the search direction of the optimizer and the steepest descent direction and show that its level dominates generalizability of the model. We also show that the stochastic noise in SGD with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. By numerically deriving the stochastic noise level in SGD and SGD with momentum, we provide theoretical findings that help explain the training dynamics of SGD with momentum, which were not explained by previous studies on convergence and stability. We also provide experimental results supporting our assertion that model generalizability depends on the stochastic noise level.