🤖 AI Summary
To address the instability, fine-tuning dependency, and poor task generalization of speculative decoding in large language model (LLM) inference, this paper proposes Hierarchical Drafting (HD), a lossless, zero-fine-tuning speculative decoding framework grounded in temporal locality. HD introduces a novel hierarchical database architecture that organizes multi-source tokens according to temporal locality, enabling robust acceleration across diverse tasks, sampling temperatures, and model scales. The method integrates multi-level cache-inspired draft generation with a lightweight matching algorithm—requiring no model modification or retraining. Evaluated on Spec-Bench, HD reduces drafting latency by up to 42% for 7B and 13B models, significantly outperforming existing database-driven draft generation approaches while guaranteeing 100% output equivalence.
📝 Abstract
Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.