Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
During large language model training, optimization techniques such as virtual pipelining and activation recomputation disrupt tensor lifetimes, causing severe GPU memory fragmentation—leading to memory waste, out-of-memory (OOM) errors, and failure of memory-intensive techniques. To address this, we propose STWeaver, a novel memory allocator featuring spatiotemporal co-planning: it performs offline analysis of tensor lifetime patterns and spatial distribution characteristics to generate near-optimal allocation strategies, and dynamically adapts online to both dense and sparse (e.g., MoE) model workloads. Its plug-and-play design enables low-overhead integration into PyTorch. Experiments demonstrate that STWeaver reduces memory fragmentation by 79.2% on average (up to 100%), improves training throughput by up to 32.5%, and significantly enhances memory efficiency and scalability for large-model training.

Technology Category

Application Category

📝 Abstract
The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2% (up to 100%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5%.
Problem

Research questions and friction points this paper is trying to address.

Reduces GPU memory fragmentation in large-scale model training
Optimizes tensor allocation via spatio-temporal planning
Enables efficient training with virtual pipeline and recomputation
Innovation

Methods, ideas, or system contributions that make the work stand out.

STWeaver combines offline planning with online allocation
Leverages spatio-temporal regularities for optimal allocation
Reduces GPU memory fragmentation by 79.2% on average
🔎 Similar Papers
No similar papers found.
Z
Zixiao Huang
Tsinghua University
J
Junhao Hu
Infinigence AI
H
Hao Lin
Tsinghua University
C
Chunyang Zhu
Infinigence AI
Y
Yueran Tang
Infinigence AI
Q
Quanlu Zhang
Infinigence AI
Z
Zhen Guo
Infinigence AI
Z
Zhenhua Li
Tsinghua University
Shengen Yan
Shengen Yan
Department of Electronic Engineering, Tsinghua University, China
Large Scale Deep LearningHeterogeneous Computing
Z
Zhenhua Zhu
Tsinghua University
Guohao Dai
Guohao Dai
Associate Professor of Shanghai Jiao Tong University
Sparse ComputationLarge-scale Graph ProcessingFPGACircuits and Systems
Y
Yu Wang
Tsinghua University