DawnPiper: A Memory-scablable Pipeline Parallel Training Framework

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address GPU memory underutilization and scalability limitations caused by imbalanced memory consumption across pipeline stages in pipeline parallelism, this paper proposes a fine-grained computational graph modeling method based on deep learning (DL) compilation. We rigorously derive an optimality theorem for performance-maximizing pipeline partitioning and design a binary-search-based pipeline partitioning algorithm grounded in this theorem. Furthermore, we develop a memory-aware cost model to enable automatic, memory-sensitive operator partitioning and optimization, coupled with end-to-end automatic code generation. Compared to vPipe and PipeDream, our approach increases the maximum trainable batch size by up to 4× and 11×, respectively, and achieves up to 1.5× higher training throughput. These improvements significantly enhance the feasibility of training large-scale models under constrained hardware resources.

Technology Category

Application Category

📝 Abstract
Pipeline parallelism is a crucial paradigm for large-scale model training. However, imbalances in memory footprint across stages can lead to significant GPU memory wastage, limiting the model sizes that pipeline parallelism can effectively support. In this paper, we introduce DawnPiper, a memory-scalable pipeline parallel training framework. Firstly, we develop a DL compilation-based profiling method that transforms the model into a fine-grained computation graph. This refinement gives us a finer granularity of model partitioning and memory optimization while facilitating automatic code generation. Based on observed memory usage characteristics, we derive a performance-optimal theorem for pipeline parallel partitioning that substantially reduces the partition search space. Secondly, we propose a binary pipeline partitioning algorithm and utilize a cost-model based memory optimization approach to efficiently identify nearly optimal pipeline parallel strategy. DawnPiper achieves up to a 4x and 11x increase in trainable maximum batch size compared to vPipe and PipeDream, respectively, and provides up to a 1.5x performance speedup compared to vPipe.
Problem

Research questions and friction points this paper is trying to address.

Addresses GPU memory wastage in pipeline parallelism
Optimizes model partitioning for memory efficiency
Enhances training performance and batch size scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

DL compilation-based fine-grained computation graph profiling
Performance-optimal theorem for pipeline partitioning
Cost-model based memory optimization approach
🔎 Similar Papers
No similar papers found.
X
Xuan Peng
The National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
Xuanhua Shi
Xuanhua Shi
Professor of Computer Science, Huazhong University of Science and Technology, China
Computer ArchitectureComputer SystemsCode Intelligence
H
Haolin Zhang
The National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Huazhong University of Science and Technology, Wuhan, 430074, China
Yunfei Zhao
Yunfei Zhao
Peking University
intelligent programcode generationcode representation
Xuehai Qian
Xuehai Qian
Tsinghua University
Computer ArchitectureComputer System