AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

📅 2023-01-17

🏛️ IEEE Transactions on Parallel and Distributed Systems

📈 Citations: 3

✨ Influential: 1

career value

242K/year

🤖 AI Summary

To address the complexity of manual parallel strategy design and the difficulty of optimizing communication overhead in large-scale distributed deep learning training, this paper proposes the first automatic parallelism search framework based on the SBP (Split-Broadcast-Partial) abstraction. Our method integrates analytical models of communication and computation performance, and introduces a customized coordinate descent algorithm that significantly reduces search cost while guaranteeing near-optimal bandwidth utilization. It supports automatic generation and efficient execution of hybrid parallel strategies across multiple granularities—data, operator, and pipeline parallelism. Experiments on multi-node single- and multi-GPU platforms demonstrate training speedups of 31.1% and 10% for Transformer, and 17.7% and 71.5% for VGG, respectively. The core contribution is the first lightweight, SBP-driven automatic parallelism paradigm, uniquely balancing theoretical interpretability with practical engineering efficiency.

📝 Abstract

Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split, Broadcast, and Partial Sum (SBP) abstraction. AutoDDL is equipped with an analytical performance model combined with a customized Coordinate Descent algorithm, which significantly reduces the scheme searching overhead. We conduct evaluations on Multi-Node-Single-GPU and Multi-Node-Multi-GPU machines using different models, including VGG and Transformer. Compared to the expert-optimized implementations, AutoDDL reduces the end-to-end training time by up to 31.1% and 10% for Transformer and up to 17.7% and 71.5% for VGG on the two parallel systems, respectively.

Problem

Research questions and friction points this paper is trying to address.

Automating distributed deep learning parallelization schemes

Reducing bandwidth cost in distributed training systems

Optimizing performance with efficient scheme search algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated exploration of parallelization schemes with optimal bandwidth

Utilizes SBP abstraction for scheme description and implementation

Combines analytical model with Coordinate Descent for efficient search

🔎 Similar Papers

No similar papers found.