PipeDec: Low-Latency Pipeline-based Inference with Dynamic Speculative Decoding towards Large-scale Models

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

To address high decoding latency and low per-task resource utilization in multi-node pipeline-parallel LLM inference, this paper proposes a pipeline-embedded dynamic speculative decoding framework. It natively integrates a draft model (LLaMA3.2-1B) into the 14-stage pipeline of a large target model (LLaMA3.1-70B), introducing a novel dynamic prediction tree mechanism that supports real-time cross-node updates and pruning—enabling tight coordination between draft token prediction and target model computation. This design significantly improves global resource utilization per task. In end-to-end decoding, it achieves 4.46×–7.79× speedup over conventional pipeline parallelism and 2.2×–2.69× over state-of-the-art tree-based speculative decoding, substantially reducing overall latency.

Technology Category

Application Category

📝 Abstract

Autoregressive large language model inference primarily consists of two stages: pre-filling and decoding. Decoding involves sequential computation for each token, which leads to significant latency. Speculative decoding is a technique that leverages the draft model combined with large model verification to enhance parallelism without sacrificing accuracy. However, existing external prediction methods face challenges in adapting to multi-node serial deployments. While they can maintain speedup under such conditions, the high latency of multi-node deployments ultimately results in low overall efficiency. We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments thereby reducing decoding latency. We integrate a draft model into the pipeline of the large model and immediately forward each prediction from the draft model to subsequent pipeline stages. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. This approach leverages the draft model's predictions to utilize all pipeline nodes for parallel decoding of a single task. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets. During the decoding phase of a single task, PipeDec achieved a 4.46x-7.79x speedup compared to traditional pipeline parallelism and a 2.2x-2.69x speedup compared to baseline tree-based speculative decoding methods. The code will be released after the review process.

Problem

Research questions and friction points this paper is trying to address.

Reduces decoding latency in large language models

Improves global resource utilization in pipeline deployments

Enhances parallelism with dynamic speculative decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipeline-based speculative decoding for low latency

Dynamic prediction tree for efficient sequence management

Integrated draft model for parallel task decoding

🔎 Similar Papers

No similar papers found.