SMART: When is it Actually Worth Expanding a Speculative Tree?

๐Ÿ“… 2026-04-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

220K/year
๐Ÿค– AI Summary
This work addresses the negative speedup observed in existing tree-based speculative decoding methods under large batch sizes or high hardware utilization, where computational overhead grows superlinearly. To overcome this limitation, the authors propose SMART, a framework that formulates speculative tree expansion as a hardware-aware optimization problem for the first time. SMART dynamically decides whether to expand a node at runtime based on the marginal benefitโ€“cost ratio, enabling training-free, plug-and-play control with high efficiency. The approach is compatible with mainstream frameworks such as MSD and EAGLE, achieving average inference speedups of 20.0% on multimodal large models and 15.4% on pure language models, across diverse GPU architectures and compute-intensive batched scenarios, all without any loss in accuracy.

Technology Category

Application Category

๐Ÿ“ Abstract
Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox'': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit--cost rule at inference time, SMART expands a node only when its marginal benefit--cost ratio exceeds the tree-level speedup. SMART is training-free and serves as a plug-and-play controller for existing frameworks like MSD and EAGLE. Extensive evaluations across three MLLMs (e.g., LLaVA, Qwen2-VL) and four LLMs (e.g., Llama-3.1, DeepSeek-R1) demonstrate that SMART consistently outperforms state-of-the-art baselines. It delivers an average additional speedup of 20.0\% for MLLMs and 15.4\% for LLMs across compute-bound batching regimes and diverse GPU architectures without performance loss.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
efficiency paradox
tree expansion
computational overhead
wall-clock speedup
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
hardware-aware optimization
marginal analysis
tree expansion
inference acceleration
๐Ÿ”Ž Similar Papers
No similar papers found.