Synergistic Tensor and Pipeline Parallelism

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) and multimodal LLMs (MLLMs) suffer from high tensor parallelism (TP) communication overhead and severe pipeline parallelism (PP) bubbles during distributed training. To address these issues, this paper proposes a co-optimized hybrid parallel scheduling method. Our core innovation lies in decoupling forward and backward passes into fine-grained computational units and constructing composite, cooperatively schedulable computation sequences—enabling, for the first time, joint overlap of TP communication and PP computation to eliminate both TP and PP bubbles simultaneously. The method requires no modifications to underlying communication libraries or model architectures and is fully compatible with mainstream distributed training frameworks. Experiments on representative LLM and MLLM training workloads demonstrate throughput improvements of 12% and 16%, respectively, significantly outperforming existing scheduling strategies that optimize TP or PP in isolation.

Technology Category

Application Category

📝 Abstract
In the machine learning system, the hybrid model parallelism combining tensor parallelism (TP) and pipeline parallelism (PP) has become the dominant solution for distributed training of Large Language Models~(LLMs) and Multimodal LLMs (MLLMs). However, TP introduces significant collective communication overheads, while PP suffers from synchronization inefficiencies such as pipeline bubbles. Existing works primarily address these challenges from isolated perspectives, focusing either on overlapping TP communication or on flexible PP scheduling to mitigate pipeline bubbles. In this paper, we propose a new synergistic tensor and pipeline parallelism schedule that simultaneously reduces both types of bubbles. Our proposed schedule decouples the forward and backward passes in PP into fine-grained computation units, which are then braided to form a composite computation sequence. This compositional structure enables near-complete elimination of TP-related bubbles. Building upon this structure, we further design the PP schedule to minimize PP bubbles. Experimental results demonstrate that our approach improves training throughput by up to 12% for LLMs and 16% for MLLMs compared to existing scheduling methods. Our source code is avaiable at https://github.com/MICLAB-BUPT/STP.
Problem

Research questions and friction points this paper is trying to address.

Reduces collective communication overheads in tensor parallelism
Minimizes synchronization inefficiencies and pipeline bubbles
Improves hybrid model parallelism for large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained decoupling of forward and backward passes
Braiding computation units into composite sequence
Synergistic schedule minimizing both TP and PP bubbles
🔎 Similar Papers
No similar papers found.
M
Mengshi Qi
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
J
Jiaxuan Peng
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China
J
Jie Zhang
J
Juan Zhu
Y
Yong Li
Huadong Ma
Huadong Ma
BUPT
Internet of ThingsMultimedia