Test-Time Speculation

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing speculative decoding methods struggle to accelerate long-text generation effectively due to a sharp decline in acceptance length. This work proposes Test-Time Speculation (TTS), which, for the first time, enables cost-free online adaptation of the draft model during inference. By leveraging validation signals from the target model, TTS performs online knowledge distillation to dynamically update the draft model, aligning it with the target model’s long-sequence output distribution and overcoming the distributional shift inherent in offline training. Evaluated on the Qwen-3, Qwen-3.5, and Llama-3.1 model families, TTS increases average acceptance length by 41%, with gains reaching up to 72%; notably, these improvements grow more pronounced as generation length increases.

📝 Abstract

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted by the target. Our studies show that the acceptance length of even state-of-the-art speculators, like DFlash, EAGLE-3 and PARD degrade with generation length, reaching values close to 1 (i.e. no speedup) within just a few thousand output tokens, making speculators ineffective for long-response tasks. Acceptance lengths decline because most speculators are trained offline on short sequences, but are forced to match the target model on much longer outputs at inference, well beyond their training distribution. To address this issue, we propose $\textit{Test-Time Speculation (TTS)}$, an online distillation approach that continuously adapts the speculator at test-time. TTS leverages the key insight that the token verification step already invokes the target model for each draft token, providing the training signal needed to adapt the draft at no additional cost. Treating the draft as the student and the target as a teacher, TTS adjusts the draft over several speculation rounds, with each update improving the draft's accuracy as generation proceeds. Our results across multiple models from the Qwen-3, Qwen-3.5, and Llama3.1 families show that TTS improves acceptance lengths over state-of-the-art speculators by up to $72\%$ and $41\%$ on average, with the benefits scaling with increased generation lengths.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

acceptance length

long-generation tasks

distribution shift

test-time adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Speculation

speculative decoding

online distillation