Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of existing blockwise speculative decoding methods, which rely on static tree structures and position-wise marginal distributions, leading to poor alignment with the target language model’s true generation trajectory, low acceptance rates, and constrained acceleration. To overcome these issues, the authors propose a budget-aware adaptive speculative decoding framework that introduces, for the first time, a dynamic tree structure coupled with a hardware-aware online expansion strategy. Without modifying the target model, the method dynamically balances draft quality against hardware resource constraints through a path-confidence-driven acceptance proxy and an adaptive best-first expansion mechanism. Experiments demonstrate that the approach achieves up to 6.61× speedup across diverse GPU architectures and benchmarks, improving inference efficiency by 39% over the current state-of-the-art blockwise speculative method, all without requiring additional training or hyperparameter tuning.

📝 Abstract

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

block diffusion

tree-structured drafting

autoregressive decoding

token generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

tree-structured drafting

budget-aware inference