Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from high inference latency due to autoregressive decoding; existing speculative decoding verification methods overemphasize distributional consistency while neglecting semantic correctness and exhibit poor generalization. Method: We propose a semantic-aware, reflective verification method that requires no additional training or model modifications. Leveraging prompt-guided self-reflection, it enables an LLM to concurrently generate both the original and self-reflective token distributions in a single forward pass, thereby performing semantic-level validation of draft tokens. Contribution/Results: This is the first work to integrate LLMs’ intrinsic reflective capability into speculative decoding verification—jointly ensuring distributional fidelity and semantic plausibility—and supports orthogonal integration with statistical validation. Experiments demonstrate significantly increased draft acceptance length, 5–15% faster decoding speed, and preserved generation quality across diverse benchmarks and model scales.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel during verification. Using prompt-based probing, we obtain both the original and reflective distributions of draft tokens in a single forward pass. The fusion of these distributions enables semantic-level verification of draft tokens that incorporates both consistency and correctness. Experiments across multiple domain benchmarks and model scales demonstrate that our method significantly increases the acceptance length of draft tokens without compromising model performance. Furthermore, we find that the proposed Reflective Verification is orthogonal to existing statistical verification methods, and their combination yields additional 5$sim$15% improvements in decoding speed.
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference latency in auto-regressive LLM decoding
Improving semantic correctness in speculative draft token verification
Enhancing speedup potential without compromising model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free semantics-aware reflective verification
Single-pass fusion of original and reflective distributions
Orthogonal combination with statistical verification methods
🔎 Similar Papers
Y
Yixuan Wang
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Y
Yijun Liu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Shiyu Ji
Shiyu Ji
University of California, Santa Barbara
Information RetrievalPrivacySecurity
Yuzhuang Xu
Yuzhuang Xu
Tsinghua University
Natural Language ProcessingEfficient AIMachine Learning
Y
Yang Xu
Research Center for Social Computing and Interactive Robotics, Harbin Institute of Technology, China
Qingfu Zhu
Qingfu Zhu
Harbin Institute of Technology
NLPCode LLM
Wanxiang Che
Wanxiang Che
Professor of Harbin Institute of Technology
Natural Language Processing