Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

📅 2024-09-25

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the slow autoregressive inference speed of large language models (LLMs) and the inherent trade-off between generation quality and efficiency, this paper proposes a novel speculative beam decoding framework with dynamic beam width—marking the first deep integration of speculative decoding and beam search. Our approach introduces four key innovations: (1) generating multiple draft sequences for the target LLM via beam-trajectory sampling from a small model; (2) an adaptive beam-width control mechanism that dynamically balances accuracy and latency; (3) a multi-tree parallel verification architecture to improve draft throughput; and (4) a memory-efficient draft caching strategy. Experiments demonstrate that our method achieves 1.8–2.3× speedup over standard beam search while preserving its generation quality, and reduces memory overhead by 37%, significantly outperforming existing speculative decoding techniques.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, whereas beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling...

Problem

Research questions and friction points this paper is trying to address.

Integrate speculative decoding with beam sampling for LLM inference.

Dynamically optimize beam numbers to balance efficiency and accuracy.

Efficiently verify multiple drafts in parallel to reduce memory costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-width speculative beam decoding for LLM inference

Adaptive beam tuning optimizes efficiency and accuracy

Tree-based parallel verification accelerates draft validation

🔎 Similar Papers

No similar papers found.