SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

In long-context generation, the verification phase of speculative decoding becomes a computational bottleneck. To address this, we propose SpecPV—a lightweight self-speculative decoding method. Its core innovation lies in leveraging partial key-value (KV) cache states for rapid verification, complemented by periodic full-state verification to dynamically bound error accumulation. SpecPV requires no additional training and is compatible with mainstream LLM architectures (e.g., LLaMA-3.1-8B-Instruct, Qwen3), preserving generation accuracy while substantially reducing verification overhead. Experiments demonstrate that SpecPV achieves up to 6× decoding speedup on long-document understanding tasks, with negligible degradation in output quality. By decoupling verification cost from context length and mitigating error propagation without architectural modification, SpecPV effectively alleviates the efficiency bottleneck of speculative decoding in long-context scenarios.

Technology Category

Application Category

📝 Abstract

Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

Problem

Research questions and friction points this paper is trying to address.

Accelerates long-context generation via speculative decoding

Reduces verification bottleneck in long-context speculative decoding

Improves decoding speed with partial and periodic full verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-speculative decoding with partial KV verification

Periodic full verification to correct accumulated errors

Achieves up to 6x speedup in long-context generation

🔎 Similar Papers

No similar papers found.