🤖 AI Summary
Speculative decoding faces a verification bottleneck due to the intractability of the joint distribution, making it challenging to simultaneously achieve high inference speed and distributional fidelity. This work proposes Hierarchical Speculative Decoding (HSD), which introduces the first provably lossless sequence-level verification mechanism. By hierarchically calibrating probabilities across accessible branches, HSD balances redundant and missing probability mass, thereby circumventing the intractability of the joint distribution. The method is both theoretically interpretable and broadly applicable across diverse speculative decoding frameworks. Experimental results demonstrate that HSD significantly improves token acceptance rates across multiple model families and benchmarks. When integrated into EAGLE-3, it achieves over a 12% gain in inference efficiency, establishing state-of-the-art decoding performance without compromising distributional fidelity.
📝 Abstract
Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.