TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

📅 2026-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing tree-based speculative decoding methods, which rely on fixed draft tree structures and cannot dynamically adapt to contextual difficulty, thereby constraining inference efficiency. We propose a training-free, budget-driven adaptive draft tree expansion framework that dynamically balances exploration breadth and generation depth under a given computational budget. By integrating confidence-aware budget allocation with a hybrid node expansion strategy, our approach achieves, for the first time, training-agnostic adaptive tree structure generation. The method is compatible with existing tree-based speculative decoders and consistently outperforms EAGLE-3 across five models and six datasets, achieving up to a 5.16× speedup over autoregressive decoding in end-to-end inference.

Technology Category

Application Category

📝 Abstract
Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a"deep-and-narrow"form for deterministic contexts and a"shallow-and-wide"form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
adaptive tree expansion
token tree
LLM inference acceleration
budget-constrained generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
adaptive token trees
budget-driven inference
LLM acceleration
tree-based generation
🔎 Similar Papers
No similar papers found.