TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

📅 2023-02-01

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing automatic tensor parallelism (TP) systems face scalability challenges due to the exponential explosion of the search space with increasing model size and tensor dimensionality. To address this, we propose TAP, the first framework that models the computational graph as a neural DAG and integrates frequent subgraph mining with a novel graph pruning algorithm, enabling sublinear-complexity TP strategy search. TAP further supports end-to-end compilation optimization for hybrid data and tensor parallelism. Compared to state-of-the-art automatic parallelism systems, TAP accelerates strategy search by 20–160× while achieving training throughput on par with expert manual tuning on mainstream large language models—including LLaMA and OPT—without sacrificing generality. Its design ensures both high performance and broad applicability across diverse model architectures and hardware configurations.

📝 Abstract

Model parallelism has become necessary to train large neural networks. However, finding a suitable model parallel schedule for an arbitrary neural network is a non-trivial task due to the exploding search space. In this work, we present a model parallelism framework TAP that automatically searches for the best data and tensor parallel schedules. Leveraging the key insight that a neural network can be represented as a directed acyclic graph, within which may only exist a limited set of frequent subgraphs, we design a graph pruning algorithm to fold the search space efficiently. TAP runs at sub-linear complexity concerning the neural network size. Experiments show that TAP is $20 imes- 160 imes$ faster than the state-of-the-art automatic parallelism framework, and the performance of its discovered schedules is competitive with the expert-engineered ones.

Problem

Research questions and friction points this paper is trying to address.

Optimizing tensor parallel strategies for large neural networks

Reducing exponential search space in auto-parallel systems

Improving scalability and speed of automatic parallelism frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer approach for tensor parallelism

Sub-linear complexity via unique substructure identification

Outperforms state-of-the-art by 160x in speed

🔎 Similar Papers

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost