Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling

πŸ“… 2024-09-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address severe resource waste and inefficient cross-task/cross-modal scheduling in distributed training of multi-task multimodal large models, this paper proposes Wavefront Schedulingβ€”a novel paradigm that models model execution as temporal wavefronts to uniformly characterize heterogeneous task loads and computational dependencies. We design a dependency-graph-driven execution engine and a heterogeneous workload-aware parallelization strategy, and build a customized distributed training runtime supporting fine-grained, dynamic, adaptive resource allocation. Evaluated on diverse multi-task multimodal models, our approach achieves up to 71% training speedup over baseline systems, significantly reduces GPU memory peak usage and communication overhead, and consistently outperforms state-of-the-art frameworks across all metrics.

Technology Category

Application Category

πŸ“ Abstract
Recent foundation models are capable of handling multiple tasks and multiple data modalities with the unified base model structure and several specialized model components. However, efficient training of such multi-task (MT) multi-modal (MM) models poses significant system challenges due to the sophisticated model architecture and the heterogeneous workloads of different tasks and modalities. In this paper, we propose Spindle, a brand new training system tailored for resource-efficient and high-performance training of MT MM models via wavefront scheduling. The key idea of Spindle is to decompose the model execution into waves and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. We build our system and evaluate it on various MT MM models. Experiments demonstrate the superior performance and efficiency of Spindle, with speedup ratio up to 71% compared to state-of-the-art training systems.
Problem

Research questions and friction points this paper is trying to address.

Efficient training of multi-task models
Handling heterogeneous workloads
Optimizing resource utilization in training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wavefront scheduling for efficiency
Heterogeneity-aware workload parallelization
Dependency-driven execution scheduling
πŸ”Ž Similar Papers
No similar papers found.
Y
Yujie Wang
Peking University
S
Shenhan Zhu
Peking University
Fangcheng Fu
Fangcheng Fu
Shanghai Jiao Tong University
machine learningdeep learningMLSysdistributed computation
Xupeng Miao
Xupeng Miao
Purdue University
Machine Learning SystemsData Management
J
Jie Zhang
Alibaba Group
J
Juan Zhu
Alibaba Group
F
Fan Hong
Alibaba Group
Y
Yong Li
Alibaba Group
B
Bin Cui
Peking University