Fast Inference via Hierarchical Speculative Decoding

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the linear increase in inference latency with output length in autoregressive large language model (LLM) generation, this paper proposes Hierarchical Speculative Decoding (HSD): a framework that employs multiple heterogeneous draft models in a layered hierarchy to propose and verify tokens in parallel, significantly improving throughput while preserving the target model’s accuracy. Unlike conventional single-draft approaches, HSD introduces the first hierarchical draft architecture and solves for its optimal configuration—including layer count, draft model selection, and token proposal width—in polynomial time via theoretical latency modeling. The method integrates Transformer forward-pass optimization, multi-model collaborative verification, and differentiable latency estimation. Experiments across diverse model scales and tasks demonstrate consistent acceleration: HSD achieves up to 1.2× higher decoding speed over the best single-draft baseline, substantially reducing end-to-end generation latency.

Technology Category

Application Category

📝 Abstract
Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in autoregressive text generation
Optimizing hierarchical speculative decoding model selection
Accelerating transformer models via parallel token verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical speculative decoding stacks multiple draft models
Algorithm verifies tokens through hierarchical model progression
Polynomial-time selection of latency-optimal hierarchy configuration
🔎 Similar Papers
No similar papers found.