AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

📅 2023-01-17
🏛️ IEEE Transactions on Parallel and Distributed Systems
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address the complexity of manual parallel strategy design and the difficulty of optimizing communication overhead in large-scale distributed deep learning training, this paper proposes the first automatic parallelism search framework based on the SBP (Split-Broadcast-Partial) abstraction. Our method integrates analytical models of communication and computation performance, and introduces a customized coordinate descent algorithm that significantly reduces search cost while guaranteeing near-optimal bandwidth utilization. It supports automatic generation and efficient execution of hybrid parallel strategies across multiple granularities—data, operator, and pipeline parallelism. Experiments on multi-node single- and multi-GPU platforms demonstrate training speedups of 31.1% and 10% for Transformer, and 17.7% and 71.5% for VGG, respectively. The core contribution is the first lightweight, SBP-driven automatic parallelism paradigm, uniquely balancing theoretical interpretability with practical engineering efficiency.
📝 Abstract
Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.
Problem

Research questions and friction points this paper is trying to address.

Automating distributed deep learning parallelization schemes
Reducing bandwidth cost in distributed training systems
Optimizing performance with efficient scheme search algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated exploration of parallelization schemes with optimal bandwidth
Utilizes SBP abstraction for scheme description and implementation
Combines analytical model with Coordinate Descent for efficient search
🔎 Similar Papers
No similar papers found.