SMART: When is it Actually Worth Expanding a Speculative Tree?

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the negative speedup observed in existing tree-based speculative decoding methods under large batch sizes or high hardware utilization, where computational overhead grows superlinearly. To overcome this limitation, the authors propose SMART, a framework that formulates speculative tree expansion as a hardware-aware optimization problem for the first time. SMART dynamically decides whether to expand a node at runtime based on the marginal benefit–cost ratio, enabling training-free, plug-and-play control with high efficiency. The approach is compatible with mainstream frameworks such as MSD and EAGLE, achieving average inference speedups of 20.0% on multimodal large models and 15.4% on pure language models, across diverse GPU architectures and compute-intensive batched scenarios, all without any loss in accuracy.

Technology Category

Application Category

📝 Abstract

Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox'': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit--cost rule at inference time, SMART expands a node only when its marginal benefit--cost ratio exceeds the tree-level speedup. SMART is training-free and serves as a plug-and-play controller for existing frameworks like MSD and EAGLE. Extensive evaluations across three MLLMs (e.g., LLaVA, Qwen2-VL) and four LLMs (e.g., Llama-3.1, DeepSeek-R1) demonstrate that SMART consistently outperforms state-of-the-art baselines. It delivers an average additional speedup of 20.0\% for MLLMs and 15.4\% for LLMs across compute-bound batching regimes and diverse GPU architectures without performance loss.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

efficiency paradox

tree expansion

computational overhead

wall-clock speedup

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

hardware-aware optimization

marginal analysis