Speculative Decoding with a Speculative Vocabulary

📅 2026-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation of existing speculative decoding methods, which rely on fixed reduced vocabularies and often suffer from low acceptance rates and inefficiency due to the absence of target tokens. To overcome this, we propose SpecVocab, a novel approach that dynamically selects an efficient vocabulary subset at each decoding step. By integrating a single-layer decoder framework with output distribution alignment, SpecVocab ensures exact output consistency with the original model while significantly increasing both the average accepted speculation length and inference throughput. Experimental results across diverse tasks demonstrate that SpecVocab consistently outperforms EAGLE-3, achieving a maximum average throughput improvement of 8.1%.

Technology Category

Application Category

📝 Abstract
Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
vocabulary reduction
language model inference
draft model
throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Dynamic Vocabulary Selection
SpecVocab
Language Model Acceleration
Token Acceptance Rate
🔎 Similar Papers
No similar papers found.