On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the overreliance of large language model training on complex adaptive optimizers and the lack of efficient, lightweight alternatives. The authors propose Magma, a plug-and-play gradient masking mechanism that dynamically modulates parameter updates by aligning momentum with gradients, thereby introducing curvature-aware geometric regularization into the optimization process. Built upon the RMSProp framework, Magma demonstrates substantial improvements in pretraining 1B-scale language models, achieving perplexity reductions of over 19% and 9% compared to Adam and Muon, respectively, while incurring negligible computational overhead.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) relies almost exclusively on dense adaptive optimizers with increasingly sophisticated preconditioners. We challenge this by showing that randomly masking parameter updates can be highly effective, with a masked variant of RMSProp consistently outperforming recent state-of-the-art optimizers. Our analysis reveals that the random masking induces a curvature-dependent geometric regularization that smooths the optimization trajectory. Motivated by this finding, we introduce Momentum-aligned gradient masking (Magma), which modulates the masked updates using momentum-gradient alignment. Extensive LLM pre-training experiments show that Magma is a simple drop-in replacement for adaptive optimizers with consistent gains and negligible computational overhead. Notably, for the 1B model size, Magma reduces perplexity by over 19\% and 9\% compared to Adam and Muon, respectively.

Problem

Research questions and friction points this paper is trying to address.

adaptive optimizers

large language models

parameter updates

optimization trajectory

preconditioners

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient masking

adaptive optimizers

geometric regularization