SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training

๐Ÿ“… 2024-12-17
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Adaptive optimizers like Adam incur prohibitive memory overhead in large language model (LLM) training due to storage of momentum and second-moment estimator states. Method: This paper proposes a stateless SGD variant that eliminates optimizer state storage entirely. It stabilizes update scales via gradient normalization and, for the first time, introduces covariance-driven gradient whitening to counteract loss landscape curvature effects. Contribution/Results: The method achieves memory consumption identical to standard SGDโ€”reducing end-to-end GPU memory usage by ~50%. In LLaMA-350M and 1.3B pretraining, it matches Adamโ€™s perplexity using only half the training tokens and doubles training throughput. Crucially, it is the first stateless optimizer to achieve convergence behavior and generalization performance on par with Adam in LLM training.

Technology Category

Application Category

๐Ÿ“ Abstract
Adaptive optimizers such as Adam (Kingma&Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $approx 50%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory cost in LLM training
Eliminates optimizer state storage
Enhances computational efficiency and scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateless SGD optimization
Gradient normalization and whitening
Reduced memory footprint
๐Ÿ”Ž Similar Papers
No similar papers found.