Non-convergence to the optimal risk for Adam and stochastic gradient descent optimization in the training of deep neural networks

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A long-standing theoretical gap exists regarding the convergence of population risk for SGD-type optimizers in deep neural network training. Method: Leveraging a generalization error analysis framework, this work establishes rigorous non-convergence results for over ten mainstream stochastic optimizers—including SGD, Momentum, Nesterov, AdaGrad, RMSProp, Adam, and AMSGrad—under arbitrary fully connected feedforward networks, general activation functions, and loss functions, without relying on over-parameterization or idealized assumptions. Contribution/Results: The paper constructs the first universal counterexample demonstrating that, with high probability, the population risk of all these optimizers fails to converge to the global optimum; instead, it is provably bounded away from optimality by a positive theoretical lower bound. This refutes the conventional belief that SGD-type methods asymptotically approach the global minimum and provides a foundational negative result that characterizes fundamental limits of optimization in deep learning.

Technology Category

Application Category

📝 Abstract
Despite the omnipresent use of stochastic gradient descent (SGD) optimization methods in the training of deep neural networks (DNNs), it remains, in basically all practically relevant scenarios, a fundamental open problem to provide a rigorous theoretical explanation for the success (and the limitations) of SGD optimization methods in deep learning. In particular, it remains an open question to prove or disprove convergence of the true risk of SGD optimization methods to the optimal true risk value in the training of DNNs. In one of the main results of this work we reveal for a general class of activations, loss functions, random initializations, and SGD optimization methods (including, for example, standard SGD, momentum SGD, Nesterov accelerated SGD, Adagrad, RMSprop, Adadelta, Adam, Adamax, Nadam, Nadamax, and AMSGrad) that in the training of any arbitrary fully-connected feedforward DNN it does not hold that the true risk of the considered optimizer converges in probability to the optimal true risk value. Nonetheless, the true risk of the considered SGD optimization method may very well converge to a strictly suboptimal true risk value.
Problem

Research questions and friction points this paper is trying to address.

Non-convergence of SGD methods to optimal risk in DNNs
Lack of theoretical explanation for SGD success in deep learning
True risk convergence to suboptimal values in DNN training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes SGD methods in DNN training
Proves non-convergence to optimal risk
Includes various SGD variants like Adam
🔎 Similar Papers
No similar papers found.