Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

📅 2025-01-31

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

183K/year

🤖 AI Summary

To address the slow inference speed of large language models (LLMs) and the limitation of existing speculative decoding methods—namely, their reliance on shared vocabularies—this paper proposes the first lossless speculative decoding framework supporting heterogeneous tokenizers. Methodologically, it requires no vocabulary alignment, modifies or retrains neither the draft nor the target model, and achieves cross-vocabulary distribution preservation via three lightweight components: token mapping, dynamic probability projection, and distribution calibration. Its core contribution is the elimination of vocabulary consistency constraints, enabling the first truly plug-and-play speculative decoding across disparate tokenizers. Evaluated on summarization, code generation, and long-context tasks, the framework achieves average speedups of 1.8–2.3× over standard autoregressive decoding, with zero accuracy degradation—outperforming all baseline speculative decoding approaches.

Technology Category

Application Category

📝 Abstract

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms achieve significant speedups over standard autoregressive decoding. By enabling any off-the-shelf model to serve as drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Problem

Research questions and friction points this paper is trying to address.

Enables LLM inference acceleration without shared vocabulary constraint

Proposes lossless speculative decoding for off-the-shelf drafter models

Achieves 2.8x speedup in autoregressive decoding across diverse tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lossless speculative decoding for heterogeneous vocabularies

No shared-vocabulary constraint for drafters

Works with off-the-shelf models

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling