Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses a key limitation of conventional speculative decoding, which often erroneously rejects semantically correct outputs due to strict token-level mismatches, thereby constraining inference efficiency. To overcome this, the authors propose a training-free calibrated speculative decoding framework that replaces exact token matching with frequency-guided candidate selection and a semantic consistency gating mechanism. The approach further incorporates an online correction memory module to capture recurring rejection patterns from past decoding steps and employs a lightweight verification strategy based on probability ratios. Evaluated across multiple large language models, the method achieves up to a 2.33× increase in throughput while preserving task accuracy and demonstrating superior performance on complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.

Problem

Research questions and friction points this paper is trying to address.

Speculative Decoding

False Rejection

Autoregressive Generation

Token Verification

Lexical Divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

Frequency-Guided Selection

Semantic Consistency Gating