Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage

๐Ÿ“… 2025-06-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address low GPU memory utilization and memory bottlenecks caused by prolonged residency of inactive tensors during multi-GPU training of large language models (LLMs), this paper proposes a tensor-lifecycle-aware GPU-SSD co-offloading framework. It models tensor activation patterns over initial training iterations to dynamically orchestrate offloading and prefetching. We introduce, for the first time, a GPUDirect Storageโ€“enabled direct tensor migration path that bypasses CPU bottlenecks. The framework further supports PyTorch compiler-level integration and coordinated scheduling across multiple GPUs and SSDs. Evaluated against ZeRO-Offload and Infinity, our approach achieves an average 1.47ร— speedup in training throughput and attains 80.7% of the ideal performance achievable without GPU memory constraints. This significantly enhances the cost-effective use of PCIe SSDs for GPU memory expansion in LLM training.

Technology Category

Application Category

๐Ÿ“ Abstract
We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via PyTorch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47x on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GPU memory usage during LLM training
Reducing cost via efficient tensor offloading to SSDs
Improving training speed with lifetime-aware tensor management
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lifetime-aware tensor offloading via GPUDirect Storage
Profiling-based tensor lifetime estimation for optimization
Direct GPU-SSD tensor migration for CPU bottleneck alleviation
๐Ÿ”Ž Similar Papers
No similar papers found.