PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models

๐Ÿ“… 2025-04-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high decoding latency and low per-task resource utilization in multi-node pipeline-parallel LLM inference, this paper proposes a pipeline-embedded dynamic speculative decoding framework. It natively integrates a draft model (LLaMA3.2-1B) into the 14-stage pipeline of a large target model (LLaMA3.1-70B), introducing a novel dynamic prediction tree mechanism that supports real-time cross-node updates and pruningโ€”enabling tight coordination between draft token prediction and target model computation. This design significantly improves global resource utilization per task. In end-to-end decoding, it achieves 4.46ร—โ€“7.79ร— speedup over conventional pipeline parallelism and 2.2ร—โ€“2.69ร— over state-of-the-art tree-based speculative decoding, substantially reducing overall latency.

Technology Category

Application Category

๐Ÿ“ Abstract
Autoregressive large language model inference primarily consists of two stages: pre-filling and decoding. Decoding involves sequential computation for each token, which leads to significant latency. Speculative decoding is a technique that leverages the draft model combined with large model verification to enhance parallelism without sacrificing accuracy. However, existing external prediction methods face challenges in adapting to multi-node serial deployments. While they can maintain speedup under such conditions, the high latency of multi-node deployments ultimately results in low overall efficiency. We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments thereby reducing decoding latency. We integrate a draft model into the pipeline of the large model and immediately forward each prediction from the draft model to subsequent pipeline stages. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. This approach leverages the draft model's predictions to utilize all pipeline nodes for parallel decoding of a single task. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets. During the decoding phase of a single task, PipeDec achieved a 4.46x-7.79x speedup compared to traditional pipeline parallelism and a 2.2x-2.69x speedup compared to baseline tree-based speculative decoding methods. The code will be released after the review process.
Problem

Research questions and friction points this paper is trying to address.

Reduces decoding latency in large language models
Improves global resource utilization in pipeline deployments
Enhances parallelism with dynamic speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline-based speculative decoding for low latency
Dynamic prediction tree for efficient sequence management
Integrated draft model for parallel task decoding
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Haofei Yin
Shandong University
Mengbai Xiao
Mengbai Xiao
Shandong University
R
Rouzhou Lu
Shandong University
X
Xiao Zhang
Shandong University
Dongxiao Yu
Dongxiao Yu
Professor of Computer Science, Shandong University
Distributed ComputingWireless NetworkingGraph Algorithms
G
Guanghui Zhang
Shandong University