PADAM: Parallel averaged Adam reduces the error for stochastic optimization in scientific machine learning

📅 2025-05-28
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
In scientific machine learning, conventional averaging strategies for stochastic optimization—such as Ruppert–Polyak averaging and exponential moving average (EMA)—require manual hyperparameter tuning, lack task adaptivity, and often inflate optimization error. To address this, we propose Parallel Averaging Adam (PADAM), the first method to introduce a parallel, adaptive averaging mechanism without additional gradient evaluations. PADAM concurrently executes multiple ADAM-based averaging variants—including Ruppert–Polyak, EMA, and others—and dynamically selects the optimal variant based on real-time optimization error. This enables error-driven, online adaptive averaging, eliminating reliance on fixed hyperparameters. We validate PADAM across major scientific deep learning frameworks—including physics-informed neural networks (PINNs), deep Galerkin methods, and deep BSDE solvers—on 13 diverse scientific tasks spanning PDE solving, optimal control, and optimal stopping. In the majority of cases, PADAM achieves the smallest or tied-best optimization error.

Technology Category

Application Category

📝 Abstract
Averaging techniques such as Ruppert--Polyak averaging and exponential movering averaging (EMA) are powerful approaches to accelerate optimization procedures of stochastic gradient descent (SGD) optimization methods such as the popular ADAM optimizer. However, depending on the specific optimization problem under consideration, the type and the parameters for the averaging need to be adjusted to achieve the smallest optimization error. In this work we propose an averaging approach, which we refer to as parallel averaged ADAM (PADAM), in which we compute parallely different averaged variants of ADAM and during the training process dynamically select the variant with the smallest optimization error. A central feature of this approach is that this procedure requires no more gradient evaluations than the usual ADAM optimizer as each of the averaged trajectories relies on the same underlying ADAM trajectory and thus on the same underlying gradients. We test the proposed PADAM optimizer in 13 stochastic optimization and deep neural network (DNN) learning problems and compare its performance with known optimizers from the literature such as standard SGD, momentum SGD, Adam with and without EMA, and ADAMW. In particular, we apply the compared optimizers to physics-informed neural network, deep Galerkin, deep backward stochastic differential equation and deep Kolmogorov approximations for boundary value partial differential equation problems from scientific machine learning, as well as to DNN approximations for optimal control and optimal stopping problems. In nearly all of the considered examples PADAM achieves, sometimes among others and sometimes exclusively, essentially the smallest optimization error. This work thus strongly suggest to consider PADAM for scientific machine learning problems and also motivates further research for adaptive averaging procedures within the training of DNNs.
Problem

Research questions and friction points this paper is trying to address.

Reduces error in stochastic optimization for scientific machine learning
Dynamically selects best averaged ADAM variant during training
Requires no extra gradient evaluations compared to standard ADAM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel averaged ADAM combines multiple averaging variants
Dynamic selection of optimal averaging during training
No extra gradient evaluations compared to standard ADAM
🔎 Similar Papers
No similar papers found.