AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Adam is widely adopted in LLM pretraining and post-training, but its second-moment estimation incurs substantial memory and computational overhead. Method: We propose AdamS—a lightweight adaptive optimizer that replaces the second moment with the square root of a weighted sum of squared gradients, retaining Adam’s convergence properties while achieving SGD-M’s efficiency. Leveraging an (L₀, L₁)-smoothness analysis, we theoretically establish that momentum norm quantifies local smoothness in Transformers, enabling a second-moment-free, hyperparameter-free adaptive normalization mechanism with provable convergence guarantees and zero code intrusion. AdamS is fully compatible with AdamW hyperparameters and requires no API or architecture modifications. Contribution/Results: Empirical evaluation on GPT-2 and Llama2 (up to 13B) demonstrates that AdamS achieves faster convergence and superior generalization in both pretraining and RLHF-based post-training, while matching SGD-M’s memory and compute costs.

Technology Category

Application Category

📝 Abstract

We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

Problem

Research questions and friction points this paper is trying to address.

Replacing second-moment estimates with momentum for LLM optimization

Improving efficiency and performance in pretraining and post-training of LLMs

Providing a model-agnostic optimizer compatible with existing pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses momentum as normalizer for LLM optimization

Eliminates need for second-moment estimates

Seamlessly integrates with existing pipelines

🔎 Similar Papers

MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning