SSDTrain: An Activation Offloading Framework to SSDs for Faster Large Language Model Training

📅 2024-08-19
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the growing GPU memory bottleneck in large language model (LLM) training—where GPU memory capacity lags behind model scale expansion and high activation tensor overhead forces excessively small micro-batches and inflates training costs—this paper proposes an adaptive activation offloading framework targeting NVMe SSDs. Our method introduces three key innovations: (1) the first fully computation–I/O overlapped offloading mechanism; (2) tensor-level deduplication and forward-pass reuse; and (3) optimal GPU memory compression with zero throughput degradation. The framework integrates seamlessly with PyTorch, Megatron, and DeepSpeed without requiring model code modifications. Evaluated on GPT, BERT, and T5, it reduces peak activation memory by 47% while rendering I/O overhead negligible. Compared to full recomputation or full in-GPU residency, our approach achieves state-of-the-art memory savings with no throughput loss.

Technology Category

Application Category

📝 Abstract
The growth rate of the GPU memory capacity has not been able to keep up with that of the size of large language models (LLMs), hindering the model training process. In particular, activations -- the intermediate tensors produced during forward propagation and reused in backward propagation -- dominate the GPU memory use. This leads to high training overhead such as high weight update cost due to the small micro-batch size. To address this challenge, we propose SSDTrain, an adaptive activation offloading framework to high-capacity NVMe SSDs. SSDTrain reduces GPU memory usage without impacting performance by fully overlapping data transfers with computation. SSDTrain is compatible with popular deep learning frameworks like PyTorch, Megatron, and DeepSpeed, and it employs techniques such as tensor deduplication and forwarding to further enhance efficiency. We extensively experimented with popular LLMs like GPT, BERT, and T5. Results demonstrate that SSDTrain reduces 47% of the activation peak memory usage. Meanwhile, SSDTrain perfectly overlaps the I/O with the computation and incurs negligible overhead. Compared with keeping activations in GPU memory and layerwise full recomputation, SSDTrain achieves the best memory savings with negligible throughput loss. We further analyze how the reduced activation memory use may be leveraged to increase throughput by increasing micro-batch size and reducing pipeline parallelism bubbles.
Problem

Research questions and friction points this paper is trying to address.

GPU memory capacity lags behind LLM sizes
Activations dominate GPU memory usage
SSDTrain offloads activations to NVMe SSDs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation offloading to NVMe SSDs
Overlapping data transfers with computation
Tensor deduplication and forwarding techniques
🔎 Similar Papers
No similar papers found.
K
Kun Wu
University of Illinois at Urbana-Champaign, USA
J
Jeongmin Brian Park
University of Illinois at Urbana-Champaign, USA
X
Xiaofan Zhang
Google, USA
M
Mert Hidayetouglu
Stanford University, USA
V
Vikram Sharma Mailthody
Nvidia, USA
Sitao Huang
Sitao Huang
Assistant Professor of EECS, University of California Irvine
Hardware AccelerationHigh-Level SynthesisFPGAParallel ComputingGPU
S
S. Lumetta
University of Illinois at Urbana-Champaign, USA
Wen-mei W. Hwu
Wen-mei W. Hwu
Senior Distinguished Research Scientist, NVIDIA; Professor and Sanders-AMD Chair of Electrical and
Computer ArchitectureCompilerParallel ComputingCognitive Computing Systems