🤖 AI Summary
Online speculative decoding faces two key challenges: vocabulary incompatibility between target and draft models, and difficulty in continuously optimizing latency during deployment. This paper proposes OmniDraft—the first edge-deployable draft framework supporting cross-vocabulary compatibility and online adaptation. It addresses vocabulary mismatch via cross-vocabulary mapping and hybrid distillation fine-tuning; introduces online n-gram caching and adaptive draft generation to dynamically accommodate diverse target LMs and incrementally learn from user data. Its core innovation is the “single-draft-model-for-multiple-targets” paradigm: a single lightweight Llama-68M draft model seamlessly collaborates with heterogeneous target models—including Vicuna-7B, Qwen2-7B, and Llama3-8B—achieving 1.5×–2× end-to-end inference speedup on edge devices, while ensuring low latency, high vocabulary compatibility, and continuous online evolution.
📝 Abstract
Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the extit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.