đ¤ AI Summary
In scientific machine learning, conventional averaging strategies for stochastic optimizationâsuch as RuppertâPolyak averaging and exponential moving average (EMA)ârequire manual hyperparameter tuning, lack task adaptivity, and often inflate optimization error. To address this, we propose Parallel Averaging Adam (PADAM), the first method to introduce a parallel, adaptive averaging mechanism without additional gradient evaluations. PADAM concurrently executes multiple ADAM-based averaging variantsâincluding RuppertâPolyak, EMA, and othersâand dynamically selects the optimal variant based on real-time optimization error. This enables error-driven, online adaptive averaging, eliminating reliance on fixed hyperparameters. We validate PADAM across major scientific deep learning frameworksâincluding physics-informed neural networks (PINNs), deep Galerkin methods, and deep BSDE solversâon 13 diverse scientific tasks spanning PDE solving, optimal control, and optimal stopping. In the majority of cases, PADAM achieves the smallest or tied-best optimization error.
đ Abstract
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying ADAM trajectory and thus on the same underlying gradients. We test the proposed PADAM optimizer in 13 stochastic optimization and deep neural network (DNN) learning problems and compare its performance with known optimizers from the literature such as standard SGD, momentum SGD, Adam with and without EMA, and ADAMW. In particular, we apply the compared optimizers to physics-informed neural network, deep Galerkin, deep backward stochastic differential equation and deep Kolmogorov approximations for boundary value partial differential equation problems from scientific machine learning, as well as to DNN approximations for optimal control and optimal stopping problems. In nearly all of the considered examples PADAM achieves, sometimes among others and sometimes exclusively, essentially the smallest optimization error. This work thus strongly suggest to consider PADAM for scientific machine learning problems and also motivates further research for adaptive averaging procedures within the training of DNNs.