SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work addresses the trade-off in existing speculative decoding methods between path dependency and drafting efficiency, which often leads to high latency or elevated verification rejection rates. The authors propose SpecBlock, a block-iterative speculative decoder that enables efficient, path-dependent tree-structured draft generation through intra-block layer-wise hidden state shifting and inter-block state inheritance. To further optimize performance, SpecBlock incorporates a co-trained ranking head, an effective prefix-masked loss, and a cost-aware Bandit-based online adaptation mechanism that dynamically balances branch allocation and inference overhead. Experiments demonstrate that SpecBlock achieves an average speedup of 8–13% over EAGLE-3 at only 44–52% of the drafting cost, with the advantage widening to 11–19% when combined with the cost-aware adaptation strategy.
📝 Abstract
Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
drafter
path dependence
LLM inference
drafting efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
block-iterative drafting
path dependence
cost-aware adaptation
dynamic tree drafting
🔎 Similar Papers
2023-12-18Neural Information Processing SystemsCitations: 52