TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address severe pipeline bubbles in large language model (LLM) high-throughput inference caused by load imbalance between prefill and decode phases and costly phase-switching overheads in conventional pipeline parallelism, this paper proposes a temporally decoupled pipeline parallelism architecture. It pioneers temporal separation of prefill and decode execution streams to eliminate phase-switching latency. Key innovations include a hierarchical controller, an AI-driven greedy prefill scheduler (leveraging sequence-length prediction and memory simulation), cross-batch work stealing, and a phase-switching policy jointly optimizing computational intensity and bubble cost. Evaluated on GPU nodes interconnected solely via PCIe, our approach achieves 1.91× higher throughput than tensor parallelism and 2.73× higher than conventional pipeline parallelism—significantly alleviating communication and scheduling bottlenecks in LLM inference.

Technology Category

Application Category

📝 Abstract

As the model size continuously increases, pipeline parallelism shows great promise in throughput-oriented LLM inference due to its low demand on communications. However, imbalanced pipeline workloads and complex data dependencies in the prefill and decode phases result in massive pipeline bubbles and further severe performance reduction. To better exploit the pipeline parallelism for high-throughput LLM inference, we propose TD-Pipe, with the key idea lies in the temporally-disaggregated pipeline parallelism architecture. Specifically, this architecture disaggregates the prefill and decode phases in the temporal dimension, so as to eliminate pipeline bubbles caused by the phase switching. TD-Pipe identifies potential issues of exploiting the novel architecture and provides solutions. First, a hierarchy-controller structure is used to better coordinate devices in pipeline parallelism by decoupling the scheduling from execution. Second, the AI-based greedy prefill approach aggressively performs more prefills by predicting the output length and simulating the memory usage. Third, the inter-batch work stealing approach dynamically balances decode phase workloads between different batches to reduce bubbles. Forth, the spatial-temporal intensity comparison approach determines the optimal switch from decode to prefill by comparing the performance drop from reduced computational intensity with that from phase switching bubbles. Extensive experiments show that TD-Pipe effectively increases the throughput of LLM inference by up to 1.91x over the existing tensor parallel approach and 2.73x over the existing pipeline parallel approach on GPU nodes with only PCIe interconnection.

Problem

Research questions and friction points this paper is trying to address.

Eliminate pipeline bubbles from phase switching in LLM inference

Balance workloads between prefill and decode phases

Optimize device coordination in pipeline parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporally-disaggregated pipeline parallelism architecture

Hierarchy-controller structure for device coordination

AI-based greedy prefill and inter-batch work stealing

🔎 Similar Papers

No similar papers found.

Authors to Follow