Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although the Adam optimizer exhibits rapid convergence, it tends to converge to sharp minima that compromise generalization performance. To address this limitation, this work proposes Inverse Adam (InvAdam), which enhances the optimizer’s ability to escape sharp minima by element-wise multiplying—rather than dividing—the first- and second-order moments. The dynamical behavior of InvAdam is analyzed through the lens of diffusion theory. Building upon this insight, the authors further integrate Adam and InvAdam into a unified framework termed DualAdam, which preserves fast convergence while substantially improving generalization. Empirical evaluations demonstrate that DualAdam consistently outperforms Adam and its state-of-the-art variants across both image classification tasks and fine-tuning of large language models.

Technology Category

Application Category

📝 Abstract
In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element-wise multiplication of the first-order and second-order moments, while Adam computes the element-wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second-order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine-tuning. The results validate that DualAdam outperforms Adam and its state-of-the-art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin-lab/DualAdam.
Problem

Research questions and friction points this paper is trying to address.

generalization
sharp minima
Adam optimizer
deep learning
flat minima
Innovation

Methods, ideas, or system contributions that make the work stand out.

InvAdam
DualAdam
flat minima
generalization
adaptive optimization
🔎 Similar Papers
No similar papers found.
T
Tao Shi
School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
Liangming Chen
Liangming Chen
Associate Professor of Systems and Control, Southern University of Science and Technology, China
Systems and ControlRigidity graph theoryMulti-agent systemsFormation controlNetwork localization
Long Jin
Long Jin
Lanzhou University
Neural NetworksRoboticsDistributed CoordinationNeural Dynamics
M
Mengchu Zhou
Helen and John C. Hartmann Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA