DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address high latency and poor domain adaptability in speculative decoding for large-vocabulary language models—caused by the draft model’s output head parameter complexity (O(|V|d)) and static, short candidate token lists—this paper proposes a context-aware dynamic shortlist mechanism. The core innovation is a lightweight meta-classifier that performs fine-grained, context-dependent token clustering to generate highly relevant, compact shortlists (<1% of vocabulary size) on-the-fly. Unlike static truncation, this approach preserves coverage while maintaining full verification integrity and substantially reducing drafting-phase computation. Experiments demonstrate a 2.1× increase in average acceptance length, shortlist sizes reduced to 20% of those in conventional methods without sacrificing acceptance rate, and consistent inference speedups of 1.8–2.3× across diverse tasks.

Technology Category

Application Category

📝 Abstract

Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.

Problem

Research questions and friction points this paper is trying to address.

Reduces drafter latency in large-vocabulary speculative decoding

Replaces static token shortlists with dynamic context-aware selection

Maintains verification accuracy while accelerating drafting process

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic shortlisting mechanism selects tokens based on context

Lightweight meta-classifiers route contexts to token clusters

Parallel execution of draft encoding and meta shortlisting

🔎 Similar Papers

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference