Better LMO-based Momentum Methods with Second-Order Information

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Stochastic momentum methods under the LMO (Linear Minimization Oracle) framework suffer from slow convergence—only $O(1/K^{1/4})$—and existing Hessian-corrected momentum (HCM) approaches are restricted to Euclidean norms and strong smoothness assumptions. Method: This work generalizes HCM to arbitrary norms and relaxed smoothness conditions by designing a non-Euclidean geometrically adaptive HCM update rule. Contribution/Results: We establish the first rigorous $O(1/K^{1/3})$ convergence rate for HCM in general convex optimization, breaking prior theoretical bottlenecks without requiring strong smoothness. The analysis accommodates broader constraint structures and geometry-aware adaptation. Experiments on MLPs and LSTMs demonstrate that the proposed method significantly outperforms state-of-the-art baselines—including Muon and Scion—in both training stability and convergence speed.

Technology Category

Application Category

📝 Abstract

The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the ${O}(1/K^{1/4})$ rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of ${O}(1/K^{1/3})$ for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.

Problem

Research questions and friction points this paper is trying to address.

Improving convergence rates of stochastic momentum methods

Extending Hessian-Corrected Momentum to arbitrary norm settings

Enhancing deep neural network training with second-order information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Hessian-Corrected Momentum into LMO framework

Provides convergence guarantees under arbitrary norm settings

Achieves improved O(1/K^{1/3}) convergence rate adaptively

🔎 Similar Papers

Momentum-based gradient descent methods for Lie groups