🤖 AI Summary
Stochastic momentum methods under the LMO (Linear Minimization Oracle) framework suffer from slow convergence—only $O(1/K^{1/4})$—and existing Hessian-corrected momentum (HCM) approaches are restricted to Euclidean norms and strong smoothness assumptions.
Method: This work generalizes HCM to arbitrary norms and relaxed smoothness conditions by designing a non-Euclidean geometrically adaptive HCM update rule.
Contribution/Results: We establish the first rigorous $O(1/K^{1/3})$ convergence rate for HCM in general convex optimization, breaking prior theoretical bottlenecks without requiring strong smoothness. The analysis accommodates broader constraint structures and geometry-aware adaptation. Experiments on MLPs and LSTMs demonstrate that the proposed method significantly outperforms state-of-the-art baselines—including Muon and Scion—in both training stability and convergence speed.
📝 Abstract
The use of momentum in stochastic optimization algorithms has shown empirical success across a range of machine learning tasks. Recently, a new class of stochastic momentum algorithms has emerged within the Linear Minimization Oracle (LMO) framework--leading to state-of-the-art methods, such as Muon, Scion, and Gluon, that effectively solve deep neural network training problems. However, traditional stochastic momentum methods offer convergence guarantees no better than the ${O}(1/K^{1/4})$ rate. While several approaches--such as Hessian-Corrected Momentum (HCM)--have aimed to improve this rate, their theoretical results are generally restricted to the Euclidean norm setting. This limitation hinders their applicability in problems, where arbitrary norms are often required. In this paper, we extend the LMO-based framework by integrating HCM, and provide convergence guarantees under relaxed smoothness and arbitrary norm settings. We establish improved convergence rates of ${O}(1/K^{1/3})$ for HCM, which can adapt to the geometry of the problem and achieve a faster rate than traditional momentum. Experimental results on training Multi-Layer Perceptrons (MLPs) and Long Short-Term Memory (LSTM) networks verify our theoretical observations.