🤖 AI Summary
This work addresses the significant accuracy degradation of draft models in long-range speculative decoding as the number of speculation steps increases—a limitation that current test-time training methods struggle to mitigate. From the perspective of context retention, the authors propose the KV cache reuse hypothesis and introduce KVShot, a diagnostic framework to systematically evaluate the impact of hidden states, KV caches, and hybrid reuse strategies on long-range performance. Their analysis reveals, for the first time, that reusing the target model’s KV cache provides the draft model with richer long-range contextual signals. Furthermore, they identify two structural bottlenecks: shallow-layer query estimation bias and sparse gradients in KV projections. Experiments demonstrate that KV reuse substantially improves long-range acceptance rates; however, end-to-end speedup remains limited under current training paradigms, offering critical diagnostic insights and optimization directions for next-generation efficient inference architectures.
📝 Abstract
Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as an explicit context, retaining the complete set of token-wise KV representations. We therefore posit the KV-Reuse Hypothesis: allowing the draft model to reuse the target KV cache can provide richer signals for long-horizon drafting. To test this hypothesis, we introduce KVShot, a diagnostic framework that compares three reuse paradigms: hidden-only, KV-only, and hybrid. Extensive evaluations on Qwen3-8B show that KV-Reuse improves long-range acceptance, although end-to-end speedups remain marginal under current training pipelines. Our analysis identifies two key structural bottlenecks: shallow drafters struggle to estimate target queries accurately, and draft-side KV projections receive sparse gradient signals. These findings suggest that realizing the full potential of KV-aware decoding requires moving beyond TTT toward block-wise training paradigms. By exposing these bottlenecks, KVShot provides a foundational diagnostic testbed and a clear roadmap for designing next-generation inference architectures.