🤖 AI Summary
Existing speculative decoding relies on layer-wise, token-level verification, resulting in short acceptance lengths and severe waste of candidate paths: (i) single-token probabilities fail to accurately reflect sequence-level distributions; and (ii) top-down verification discards entire subtrees upon rejection of any parent node. This work proposes a novel leaf-to-root, path-level verification paradigm that jointly models the probability of an entire candidate path and enables sequence-level parallel verification via tree-structured computation. We provide theoretical proof that the method exactly reproduces the target model’s distribution—eliminating approximation error inherent in conventional verification frameworks. Experiments across multiple large language models and tasks demonstrate substantial improvements in average acceptance length and throughput, achieving both inference acceleration and high generation quality without compromising distributional fidelity.
📝 Abstract
Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in parallel to determine whether the drafted tokens should be accepted or rejected. To enhance acceptance rates, existing frameworks typically construct token trees containing multiple candidates in each timestep. However, their reliance on token-level verification mechanisms introduces two critical limitations: First, the probability distribution of a sequence differs from that of individual tokens, leading to suboptimal acceptance length. Second, current verification schemes begin from the root node and proceed layer by layer in a top-down manner. Once a parent node is rejected, all its child nodes should be discarded, resulting in inefficient utilization of speculative candidates. This paper introduces Traversal Verification, a novel speculative decoding algorithm that fundamentally rethinks the verification paradigm through leaf-to-root traversal. Our approach considers the acceptance of the entire token sequence from the current node to the root, and preserves potentially valid subsequences that would be prematurely discarded by existing methods. We theoretically prove that the probability distribution obtained through Traversal Verification is identical to that of the target model, guaranteeing lossless inference while achieving substantial acceleration gains. Experimental results across different large language models and multiple tasks show that our method consistently improves acceptance length and throughput over existing methods