PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pipeline parallelism (PP) for large language model training suffers from explosive growth in microbatch count with increasing pipeline stages, causing severe activation memory blowup and fundamentally limiting scalability. Method: We first systematically reveal that over 50% of activations in most PP configurations can be offloaded to CPU with near-zero overhead; we then propose a selective activation offloading strategy, tightly integrating recomputation, communication scheduling, and the Zero-Bubble execution framework to jointly optimize memory footprint and throughput. Contribution: Peak per-device activation memory exhibits superlinear reduction as the number of PP stages increases; compared to tensor parallelism (TP), our approach achieves up to 19% higher training throughput while substantially reducing GPU memory consumption—effectively breaking the PP scalability bottleneck.

Technology Category

Application Category

📝 Abstract
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19% acceleration with even lower memory consumption. The implementation is open-sourced at href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.
Problem

Research questions and friction points this paper is trying to address.

Reduces activation memory in pipeline parallelism for LLMs.
Introduces selective offload to decrease peak memory usage.
Combines memory offload with other techniques to enhance throughput.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages memory offload strategy in pipeline parallelism
Introduces selective offload to reduce peak activation memory
Integrates offload with techniques for throughput optimization
🔎 Similar Papers
No similar papers found.