SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-context generation, the verification phase of speculative decoding becomes a computational bottleneck. To address this, we propose SpecPV—a lightweight self-speculative decoding method. Its core innovation lies in leveraging partial key-value (KV) cache states for rapid verification, complemented by periodic full-state verification to dynamically bound error accumulation. SpecPV requires no additional training and is compatible with mainstream LLM architectures (e.g., LLaMA-3.1-8B-Instruct, Qwen3), preserving generation accuracy while substantially reducing verification overhead. Experiments demonstrate that SpecPV achieves up to 6× decoding speedup on long-document understanding tasks, with negligible degradation in output quality. By decoupling verification cost from context length and mitigating error propagation without architectural modification, SpecPV effectively alleviates the efficiency bottleneck of speculative decoding in long-context scenarios.

Technology Category

Application Category

📝 Abstract
Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.
Problem

Research questions and friction points this paper is trying to address.

Accelerates long-context generation via speculative decoding
Reduces verification bottleneck in long-context speculative decoding
Improves decoding speed with partial and periodic full verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-speculative decoding with partial KV verification
Periodic full verification to correct accumulated errors
Achieves up to 6x speedup in long-context generation
🔎 Similar Papers
No similar papers found.
Z
Zhendong Tan
School of Computer Science and Technology, Xi’an Jiao-tong University, Xi’an, China
X
Xingjun Zhang
School of Computer Science and Technology, Xi’an Jiao-tong University, Xi’an, China
C
Chaoyi Hu
School of Computer Science and Technology, Xi’an Jiao-tong University, Xi’an, China
Junjie Peng
Junjie Peng
Shanghai University
K
Kun Xia
School of Computer Science and Technology, Xi’an Jiao-tong University, Xi’an, China