Block Verification Accelerates Speculative Decoding

📅 2024-03-15
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Token-level verification in speculative decoding of large language models suffers from low efficiency. Method: This paper proposes a block-level joint verification mechanism that strictly preserves output distribution equivalence and incurs zero accuracy loss, while improving verification efficiency. It provides the first theoretical proof and realization of optimality in expected token yield—guaranteeing performance no worse than conventional token-level verification—and designs a parallel verification algorithm grounded in probabilistic consistency analysis, enabling seamless integration into existing speculative decoding frameworks. Contribution/Results: As the first verification paradigm combining deterministic acceleration, zero computational overhead, and strong theoretical guarantees, it achieves stable end-to-end inference speedups of 5%–8% across diverse tasks and datasets, without increasing implementation complexity or sacrificing model performance.

Technology Category

Application Category

📝 Abstract
Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a guarantee that the output is distributed identically to a sample from the target model. In prior works, draft verification is performed independently token-by-token. Surprisingly, we show that this approach is not optimal. We propose Block Verification, a simple draft verification algorithm that verifies the entire block jointly and provides additional wall-clock speedup. We prove that the proposed mechanism is optimal in the expected number of tokens produced each iteration and specifically is never worse than the standard token-level verification. Empirically, block verification provides modest but consistent wall-clock speedups over the standard token verification algorithm of 5%-8% in a range of tasks and datasets. Given that block verification does not increase code complexity, maintains the strong lossless guarantee of the standard speculative decoding verification algorithm, cannot deteriorate performance, and, in fact, consistently improves it, it can be used as a good default in speculative decoding implementations.
Problem

Research questions and friction points this paper is trying to address.

Optimizing draft verification in speculative decoding
Improving wall-clock speedup via block verification
Ensuring lossless acceleration for large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint block verification for faster decoding
Optimal token production per iteration
Consistent 5%-8% speedup without complexity
🔎 Similar Papers
No similar papers found.
Z
Ziteng Sun
Google Research, New York
Jae Hun Ro
Jae Hun Ro
Google Research, New York
Ahmad Beirami
Ahmad Beirami
Google DeepMind
Machine LearningNatural Language ProcessingStatisticsInformation TheoryOptimization
A
A. Suresh
Google Research, New York
U
Uri Mendlovic
Y
Yaniv Leviathan
A
A. Aharoni