Adaptive Optimization via Momentum on Variance-Normalized Gradients

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the training instability and performance degradation in Adam-family optimizers caused by the temporal coupling between momentum and stochastic normalization. To resolve this, we propose MVN-Grad, a novel optimizer that innovatively places exponential moving average–based gradient variance normalization before momentum computation. This design effectively decouples the two components, reduces update variance, enhances robustness to gradient outliers, and prevents sign collapse under low-variance conditions. Theoretical analysis confirms the convergence and stability of MVN-Grad. Empirical evaluations on CIFAR-100 image classification and GPT-style language modeling demonstrate that MVN-Grad achieves smoother training dynamics and better or comparable generalization performance relative to Adam, AdaBelief, and LaProp—all without incurring additional computational overhead.

Technology Category

Application Category

📝 Abstract

We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead.

Problem

Research questions and friction points this paper is trying to address.

adaptive optimization

gradient normalization

momentum coupling

outlier robustness

variance collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

variance-normalized gradients

momentum decoupling

adaptive optimization

robustness to outliers

accelerated convergence

🔎 Similar Papers

Multiple importance sampling for stochastic gradient estimation

2024-07-22arXiv.orgCitations: 0

💼 Related Jobs

Research Engineer, Monetization AI