Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Autoregressive decoding constitutes the primary latency bottleneck in large language model (LLM) inference. Existing speculative decoding methods rely on offline training or auxiliary modules, incurring high computational costs and exhibiting sensitivity to distributional shift. This paper proposes a training-aware self-speculation framework: the first to integrate online learning into speculative decoding, dynamically optimizing the draft model via verifier feedback and employing a progressive scheduling strategy that bridges KL divergence minimization with reinforcement learning. Leveraging an LLM block-wise architecture, it unifies online knowledge distillation, reward-masked cross-entropy loss, and on-policy policy gradients—enabling continual learning and inference acceleration within a single, lossless deployment. Evaluated on Spec-Bench, our method achieves 2.16× end-to-end speedup, matching state-of-the-art methods like EAGLE-2, while reducing training data requirements by several orders of magnitude and significantly outperforming pure KL-based distillation baselines.

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) decoding is a major latency bottleneck for large language models. Speculative decoding (SD) accelerates AR by letting a drafter propose multi-token blocks that a verifier accepts or rejects. However, many SD systems require heavy offline training or extra components. These choices raise data/compute cost and can yield brittle drafters under distribution drift. We introduce emph{Draft, Verify, & Improve (DVI)}, a training-aware self-speculative framework that combines inference with continual online learning. We partition an LLM into a drafter and a verifier, and during generation, verifier accept/reject decisions are converted into supervision signals and used to update the drafter head. A simple emph{KL$ ightarrow$RL} schedule bootstraps calibration via online distillation and then adds reward-masked cross-entropy with a on-policy policy-gradient term, preserving lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16 imes$ wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of magnitude less data for training, and ablations show that DVI outperforms KL-only online distillation. DVI demonstrates that emph{training-aware} self-speculation can deliver state-of-the-art, lossless speedups with minimal training overhead.

Problem

Research questions and friction points this paper is trying to address.

Accelerating autoregressive decoding to reduce LLM inference latency

Eliminating heavy offline training requirements for speculative decoding

Maintaining lossless generation while enabling online drafter improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-aware self-speculative framework combines inference with online learning

Partitions LLM into drafter and verifier for continuous self-improvement

KL-to-RL schedule enables lossless deployment with minimal training data

🔎 Similar Papers

No similar papers found.