Reject Only Critical Tokens: Pivot-Aware Speculative Decoding

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional speculative decoding (SD) enforces strict token-level distributional alignment between draft and target models, resulting in low acceptance rates and limited speedup. This work argues that task utility—e.g., code correctness or factual accuracy—is more practically meaningful than distributional fidelity, and proposes “utility alignment” as a new paradigm: only rejecting *pivot tokens*—those whose rejection demonstrably improves downstream performance. To this end, we design a lightweight classifier that dynamically identifies pivot tokens based on task-specific metrics and selectively filters non-pivot tokens during decoding. This is the first SD framework to shift the optimization objective from distribution matching to utility alignment, incorporating a pivot-aware mechanism that preserves target model accuracy while substantially increasing acceptance rates. Experiments across diverse tasks demonstrate up to 2.5× inference speedup with no degradation in task utility.

Technology Category

Application Category

📝 Abstract

Speculative Decoding (SD) ensures that the output matches the target model's distribution exactly. However, we argue that this distribution matching requirement is too stringent and results in unnecessarily low acceptance rates, limiting potential speedups. Instead, we advocate a reformulation of the decoding objective: the proposed decoding strategy should match the expected utility, i.e., the task-specific performance, of the target model. This perspective also aligns better with real-world use cases of LLMs, where utility (e.g., code correctness, factual accuracy) is often more important than sampling distribution. Based on this reformulation, we propose a novel decoding strategy: Pivot-Aware Speculative Decoding, which rejects only those tokens that would lead to a utility drop in the final output. We refer to these critical tokens as pivot tokens. We propose a method for labeling tokens as pivotal or non-pivotal and train a lightweight classifier to detect them. This method can be viewed as a relaxed version of standard SD, which offers much higher acceptance while preserving utility. We evaluate our method across various datasets, demonstrating that we can achieve up to $2.5 imes$ speedup with comparable utility. Source code is available at https://github.com/amir-zsh/PAD.

Problem

Research questions and friction points this paper is trying to address.

Improving speculative decoding acceptance rates while maintaining utility

Identifying critical tokens that impact task-specific performance metrics

Achieving faster inference speeds without compromising output quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rejects only critical tokens for utility preservation

Uses lightweight classifier to detect pivotal tokens

Achieves higher speedup while maintaining task performance

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling