Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

📅 2021-07-14
🏛️ International Conference for High Performance Computing, Networking, Storage and Analysis
📈 Citations: 145
Influential: 18
📄 PDF
🤖 AI Summary
To address the substantial pipeline bubble overhead, imbalanced GPU memory utilization, and throughput limitations in large-scale neural network training, this paper proposes a synchronous bidirectional pipeline parallelism mechanism. Our approach introduces a novel synchronous bidirectional micro-batch scheduling strategy that achieves dynamic activation memory balancing and minimizes pipeline bubbles while preserving full-precision computation. Integrated with Transformer-specific distributed training optimizations, the proposed method trains a 1.3-billion-parameter GPT-2 model on the Piz Daint supercomputer (2,048 GPUs). It achieves 1.16×–2.34× higher throughput compared to state-of-the-art synchronous and asynchronous pipeline parallel methods. This advancement significantly improves training efficiency and hardware resource utilization for large-scale models.
📝 Abstract
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchro-nous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; ben-efiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
Problem

Research questions and friction points this paper is trying to address.

Efficiently training large-scale neural networks with bidirectional pipelines
Reducing pipeline bubbles and balancing memory consumption
Improving training throughput for billion-parameter models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional pipeline parallelism scheme
Synchronous training ensuring accuracy
Balanced activation memory consumption
🔎 Similar Papers
No similar papers found.