TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speculative decoding (SD) accelerates large language model (LLM) inference but is constrained by the requirement that the draft and target models share an identical vocabulary—limiting draft model selection and often necessitating costly retraining. This work proposes TokenTiming, the first SD framework to integrate dynamic time warping (DTW) for cross-vocabulary speculative decoding. TokenTiming dynamically aligns token sequences and probability distributions via sequence recoding and DTW-based soft alignment, enabling seamless cooperation between arbitrary off-the-shelf models without architectural modification or retraining. Crucially, it eliminates vocabulary compatibility constraints while preserving decoding correctness and efficiency. Experiments across diverse NLP tasks demonstrate an average 1.57× inference speedup over standard autoregressive decoding, with consistent latency reduction and throughput improvement. TokenTiming significantly enhances the practicality, flexibility, and generalizability of speculative decoding, establishing a foundation for vocabulary-agnostic acceleration of LLM inference.

Technology Category

Application Category

📝 Abstract
Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
Problem

Research questions and friction points this paper is trying to address.

Addresses vocabulary mismatch in speculative decoding for LLMs
Enables universal draft model selection without retraining requirements
Accelerates LLM inference using dynamic token alignment method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Time Warping aligns draft and target tokens
Re-encoding draft tokens enables vocabulary mismatch handling
Works with off-the-shelf models without retraining requirements
🔎 Similar Papers
No similar papers found.
S
Sibo Xiao
The State Key Laboratory of Blockchain and Data Security, Zhejiang University
J
Jinyuan Fu
College of Computer Science and Technology
Zhongle Xie
Zhongle Xie
Zhejiang University
AI4DBML SystemsDB SystemsOLAP
Lidan Shou
Lidan Shou
Professor of Computer Science, Zhejiang University
DatabaseData & Knowledge ManagementML Systems