Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability, fine-tuning dependency, and poor task generalization of speculative decoding in large language model (LLM) inference, this paper proposes Hierarchical Drafting (HD), a lossless, zero-fine-tuning speculative decoding framework grounded in temporal locality. HD introduces a novel hierarchical database architecture that organizes multi-source tokens according to temporal locality, enabling robust acceleration across diverse tasks, sampling temperatures, and model scales. The method integrates multi-level cache-inspired draft generation with a lightweight matching algorithm—requiring no model modification or retraining. Evaluated on Spec-Bench, HD reduces drafting latency by up to 42% for 7B and 13B models, significantly outperforming existing database-driven draft generation approaches while guaranteeing 100% output equivalence.

Technology Category

Application Category

📝 Abstract
Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass. However, current drafting strategies usually require significant fine-tuning or have inconsistent performance across tasks. To address these challenges, we propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality. In the drafting step, HD sequentially accesses multiple databases to obtain draft tokens from the highest to the lowest locality, ensuring consistent acceleration across diverse tasks and minimizing drafting latency. Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.
Problem

Research questions and friction points this paper is trying to address.

Accelerate inference in Large Language Models
Improve consistency across diverse tasks
Minimize drafting latency using hierarchical drafting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Drafting based on temporal locality
Lossless acceleration in speculative decoding
Multiple databases for consistent task performance
🔎 Similar Papers
No similar papers found.