🤖 AI Summary
To address the sensitivity to learning rates, susceptibility to local optima, and poor adaptability to complex models exhibited by SGD and Adam in non-convex optimization, this paper proposes DWMGrad—a novel adaptive optimization algorithm. Its core innovation lies in a dynamic weighting mechanism grounded in gradient history, which jointly modulates the momentum coefficient and learning rate: (i) momentum is estimated via time-decayed weighted accumulation of past gradients; and (ii) step sizes are adaptively scaled using historical second-order moments, enhancing robustness to intricate loss landscapes. DWMGrad seamlessly integrates SGD’s stability with Adam’s adaptivity while introducing no additional hyperparameters. Extensive experiments across image classification and language modeling tasks demonstrate that DWMGrad accelerates convergence by an average of 18.7% and achieves higher test accuracy, validating its effectiveness and generalizability.
📝 Abstract
Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.