🤖 AI Summary
This work addresses the limitation of fixed-length drafts in speculative decoding, which constrains acceleration potential by failing to adapt to the dynamically varying optimal draft length across decoding steps. To overcome this, the authors propose PACER, a method that introduces a lightweight, trainable pre-validation layer to chunk-wise validate draft tokens and dynamically adjust draft length for adaptive speculative decoding. By breaking free from the performance bottleneck of conventional fixed-length strategies, PACER achieves up to 2.66× speedup over standard autoregressive decoding across multiple benchmarks, outperforming baseline speculative decoding. When integrated with Ouroboros, the acceleration ratio further improves to 3.09×.
📝 Abstract
Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.