🤖 AI Summary
Existing autoregressive decoding of large language models (LLMs) suffers from low GPU utilization and high latency; meanwhile, conventional speculative decoding (SD) incurs substantial overhead—especially under large batch sizes—when speculation accuracy degrades, leading to frequent token rejections. To address these issues, this paper proposes a dynamic speculative decoding enhancement method that requires no modification to the primary LLM and is compatible with diverse SD variants. Its core innovations are: (i) introducing a lightweight auxiliary model to quantify alignment between draft and target model output distributions, and (ii) adaptively optimizing the verification length based on information gain to minimize redundant computation. Experiments across multiple LLMs and tasks demonstrate that, at batch sizes of 32–80, the method achieves an average speedup of 1.4× over standard autoregressive decoding, with peak improvements up to 2.0×, significantly outperforming both baseline decoding and traditional SD approaches.
📝 Abstract
LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target model. However, when speculation accuracy is low, the overhead from rejected tokens can offset the benefits, limiting SD's effectiveness, especially at large batch sizes. To address this, we propose Speculative Verification (SV), an efficient augmentation to SD that dynamically predicts speculation accuracy and adapts the verification length to maximize throughput. SV introduces a companion model - a small auxiliary model similar in size to the draft model - to estimate the alignment between draft and target model distributions. By maximizing the information gain from quantifying this alignment, SV refines verification decisions, reducing wasted computation on rejected tokens and improving decoding efficiency. Moreover, SV requires no modifications to the draft or target models and is compatible with existing SD variants. We extensively evaluated SV on publicly available LLMs across three NLP tasks using nine combinations of draft, companion, and target models, including 13B-72B target models and three types of variations: base (no finetuning), instruction-tuned, and task fine-tuned. Across all experiments and batch sizes (4-80), SV consistently outperforms both SD and standard decoding with the target model. It improves SD performance by up to 2$ imes$, with an average speedup of 1.4 $ imes$ in large-batch settings (batch sizes 32-80). These results demonstrate SV's robustness, scalability, and practical utility for efficient LLM inference.