🤖 AI Summary
This work addresses the challenge of balancing stability and convergence speed in momentum-based optimization methods when training large-scale neural networks over complex loss landscapes. We propose a continuous-time dynamical systems framework that incorporates cubic nonlinear damping—inspired by structural dynamics—together with parameter-wise adaptive momentum and kinetic energy feedback control. This mechanism dynamically responds to local curvature, enabling an effective trade-off between stability and rapid convergence. Building upon this framework, we develop enhanced variants of momentum SGD (mSGD) and Adam equipped with cubic damping. Empirical evaluations on Vision Transformers (ViT), BERT, and GPT-2 demonstrate that our methods match or surpass standard Adam in performance, while theoretical analysis establishes their exponential convergence guarantees.
📝 Abstract
We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.