Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High memory overhead of optimizers like Adam severely limits the scalability of large language model (LLM) training. Method: This paper introduces Gradient Wavelet Transform (GWT), the first approach to apply wavelet analysis to gradient compression—departing from conventional low-rank approximation paradigms. GWT achieves efficient gradient sparsification and optimizer state compression while preserving the update rank, and is fully compatible with standard optimizers (e.g., Adam) without modifying training procedures or model architectures; it natively supports distributed training. Contribution/Results: On both pretraining and fine-tuning tasks, GWT reduces GPU memory consumption by 42%–68% over full-precision Adam while maintaining equivalent model performance. It significantly outperforms existing memory-efficient optimizers, achieving state-of-the-art (SOTA) results in memory-accuracy trade-off.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Memory Consumption
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient Wavelet Transform
Memory Efficiency
Adam Optimizer
🔎 Similar Papers
No similar papers found.
Ziqing Wen
Ziqing Wen
National University of Defense Technology
OptimizationMachine learning theory
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
Jiahuan Wang
Jiahuan Wang
National University of Defense Technology
Machine Learning
X
Xiaoge Deng
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China.
J
Jinping Zou
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China.
K
Kun Yuan
Center for Machine Learning Research, Peking University, Bejing, China.
T
Tao Sun
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China.
D
Dongsheng Li
College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China.