OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Online speculative decoding faces two key challenges: vocabulary incompatibility between target and draft models, and difficulty in continuously optimizing latency during deployment. This paper proposes OmniDraft—the first edge-deployable draft framework supporting cross-vocabulary compatibility and online adaptation. It addresses vocabulary mismatch via cross-vocabulary mapping and hybrid distillation fine-tuning; introduces online n-gram caching and adaptive draft generation to dynamically accommodate diverse target LMs and incrementally learn from user data. Its core innovation is the “single-draft-model-for-multiple-targets” paradigm: a single lightweight Llama-68M draft model seamlessly collaborates with heterogeneous target models—including Vicuna-7B, Qwen2-7B, and Llama3-8B—achieving 1.5×–2× end-to-end inference speedup on edge devices, while ensuring low latency, high vocabulary compatibility, and continuous online evolution.

Technology Category

Application Category

📝 Abstract
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the extit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
Problem

Research questions and friction points this paper is trying to address.

Cross-vocabulary mismatch between draft and target models
Dynamic adaptation to user data for latency improvement
Single draft model compatibility with diverse target models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online n-gram cache with hybrid distillation
Adaptive drafting techniques for speedup
Single draft model for multiple targets
🔎 Similar Papers
No similar papers found.
Ramchalam Kinattinkara Ramakrishnan
Ramchalam Kinattinkara Ramakrishnan
Qualcomm AI Research, Toronto
Machine LearningComputer ScienceDeep Learning
Z
Zhaocong Yuan
Qualcomm AI Research
Shaojie Zhuo
Shaojie Zhuo
Qualcomm
Efficient Training and InferenceVisionSpeechLanguage
C
Chen Feng
Qualcomm AI Research
Y
Yicheng Lin
Qualcomm AI Research
C
Chenzheng Su
Qualcomm AI Research
X
Xiaopeng Zhang
Qualcomm AI Research