๐ค AI Summary
To address high decoding latency and low per-task resource utilization in multi-node pipeline-parallel LLM inference, this paper proposes a pipeline-embedded dynamic speculative decoding framework. It natively integrates a draft model (LLaMA3.2-1B) into the 14-stage pipeline of a large target model (LLaMA3.1-70B), introducing a novel dynamic prediction tree mechanism that supports real-time cross-node updates and pruningโenabling tight coordination between draft token prediction and target model computation. This design significantly improves global resource utilization per task. In end-to-end decoding, it achieves 4.46รโ7.79ร speedup over conventional pipeline parallelism and 2.2รโ2.69ร over state-of-the-art tree-based speculative decoding, substantially reducing overall latency.
๐ Abstract
Autoregressive large language model inference primarily consists of two stages: pre-filling and decoding. Decoding involves sequential computation for each token, which leads to significant latency. Speculative decoding is a technique that leverages the draft model combined with large model verification to enhance parallelism without sacrificing accuracy. However, existing external prediction methods face challenges in adapting to multi-node serial deployments. While they can maintain speedup under such conditions, the high latency of multi-node deployments ultimately results in low overall efficiency. We propose a speculative decoding framework named PipeDec to address the low global resource utilization of single tasks in pipeline deployments thereby reducing decoding latency. We integrate a draft model into the pipeline of the large model and immediately forward each prediction from the draft model to subsequent pipeline stages. A dynamic prediction tree manages prediction sequences across nodes, enabling efficient updating and pruning. This approach leverages the draft model's predictions to utilize all pipeline nodes for parallel decoding of a single task. Experiments were conducted using LLama3.2 1B as the draft model in conjunction with a 14-stage parallel pipeline to accelerate LLama3.1 70B by six different types of datasets. During the decoding phase of a single task, PipeDec achieved a 4.46x-7.79x speedup compared to traditional pipeline parallelism and a 2.2x-2.69x speedup compared to baseline tree-based speculative decoding methods. The code will be released after the review process.