Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the limitations of conventional draft models that rely on token-level supervision, which struggle to optimize window-level efficiency in speculative decoding and often suffer from premature window truncation due to unpredictable tokens, thereby constraining acceleration gains. To overcome this, the authors propose PPOW, a novel framework that reformulates draft model optimization as a window-level reinforcement learning problem rather than token-level imitation. PPOW introduces a cost-aware acceleration reward, a distribution similarity reward, and an adaptive divergence-aware windowing mechanism, prioritizing optimization on high-information windows where the draft and target models exhibit significant divergence. Experimental results demonstrate that PPOW achieves average accepted lengths of 6.29–6.52 and accelerates inference by 3.39–4.36× across diverse models and benchmarks.
📝 Abstract
Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft model
window-level optimization
acceptance length
inference acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Reinforcement Learning
Window-Level Optimization
Adaptive Windowing
Draft Model
🔎 Similar Papers
No similar papers found.