Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the mismatch between the high memory overhead of optimizers like AdamW—commonly used in the reinforcement learning (RL) phase of large language models—and the specific demands of RL tasks. Within the RLVR framework, the study systematically investigates replacing AdamW with low-memory stochastic gradient descent (SGD). Remarkably, SGD naturally yields extremely sparse parameter updates (<0.02%) without requiring explicit sparsity-inducing regularization, while matching or even surpassing AdamW in performance. These findings challenge the prevailing paradigm of optimizer selection in RL for large models, demonstrating that SGD can significantly enhance memory efficiency without sacrificing effectiveness, thereby offering a promising new direction for scalable and efficient reinforcement learning with large language models.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts. However, optimization practices in RL largely follow those of next-token prediction stages (e.g., pretraining and supervised fine-tuning), despite fundamental differences between RL and these stages highlighted by recent work. One such practice is the use of the AdamW optimizer, which is widely adopted for training large-scale transformers despite its high memory overhead. Our analysis shows that both momentum and adaptive learning rates in AdamW are less influential in RL than in SFT, leading us to hypothesize that RL benefits less from Adam-style per-parameter adaptive learning rates and momentum. Confirming this hypothesis, our experiments demonstrate that the substantially more memory-efficient SGD, which is known to perform poorly in supervised learning of large-scale transformers, matches or even outperforms AdamW in RL for LLMs. Remarkably, full fine-tuning with SGD updates fewer than 0.02% of model parameters without any sparsity-promoting regularization, more than 1000 times fewer than AdamW. Our analysis offers potential reasons for this update sparsity. These findings provide new insights into the optimization dynamics of RL in LLMs and show that RL can be substantially more parameter-efficient than previously recognized.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Large Language Models

Optimization

AdamW

SGD

Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD

Reinforcement Learning

Parameter Sparsity