Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

A long-standing theoretical gap exists regarding the convergence of population risk for SGD-type optimizers in deep neural network training. Method: Leveraging a generalization error analysis framework, this work establishes rigorous non-convergence results for over ten mainstream stochastic optimizers—including SGD, Momentum, Nesterov, AdaGrad, RMSProp, Adam, and AMSGrad—under arbitrary fully connected feedforward networks, general activation functions, and loss functions, without relying on over-parameterization or idealized assumptions. Contribution/Results: The paper constructs the first universal counterexample demonstrating that, with high probability, the population risk of all these optimizers fails to converge to the global optimum; instead, it is provably bounded away from optimality by a positive theoretical lower bound. This refutes the conventional belief that SGD-type methods asymptotically approach the global minimum and provides a foundational negative result that characterizes fundamental limits of optimization in deep learning.

Technology Category

Application Category

📝 Abstract

Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.

Problem

Research questions and friction points this paper is trying to address.

Non-convergence of SGD methods to optimal risk in DNNs

Lack of theoretical explanation for SGD success in deep learning

True risk convergence to suboptimal values in DNN training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes SGD methods in DNN training

Proves non-convergence to optimal risk

Includes various SGD variants like Adam

🔎 Similar Papers

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks