FOAM: Blocked State Folding for Memory-Efficient LLM Training

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive GPU memory overhead of Adam-style optimizers in large language model (LLM) training—often exceeding 70% of total memory—this paper proposes BlockAdam, a memory-efficient optimizer based on block-wise gradient mean compression and residual correction. Its core innovation is a block-state folding mechanism: optimizer states are compressed via per-parameter-block averaging, while dynamic residual correction preserves critical gradient information—avoiding computationally expensive or performance-degrading techniques such as SVD, projection, or parameter freezing. We theoretically establish that BlockAdam achieves the same convergence rate as Adam under standard non-convex assumptions. Empirical evaluation on mainstream LLM training shows that BlockAdam reduces optimizer state memory by 90%, cuts total GPU memory consumption by ~50%, and maintains or even exceeds the training throughput of full-rank Adam and existing memory-efficient baselines.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50%, eliminates up to 90% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory bottlenecks in large language model training
Compresses optimizer states to cut memory overhead significantly
Maintains convergence rates while accelerating training speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compresses optimizer states via block-wise gradient means
Incorporates residual correction to recover lost information
Reduces training memory by 50% and optimizer state by 90%
🔎 Similar Papers
No similar papers found.
Ziqing Wen
Ziqing Wen
National University of Defense Technology
OptimizationMachine learning theory
Jiahuan Wang
Jiahuan Wang
National University of Defense Technology
Machine Learning
Ping Luo
Ping Luo
National University of Defense Technology
distributed_computing
D
Dongsheng Li
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China
T
Tao Sun
National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology, Changsha, Hunan, China