๐ค AI Summary
This work addresses the negative speedup observed in existing tree-based speculative decoding methods under large batch sizes or high hardware utilization, where computational overhead grows superlinearly. To overcome this limitation, the authors propose SMART, a framework that formulates speculative tree expansion as a hardware-aware optimization problem for the first time. SMART dynamically decides whether to expand a node at runtime based on the marginal benefitโcost ratio, enabling training-free, plug-and-play control with high efficiency. The approach is compatible with mainstream frameworks such as MSD and EAGLE, achieving average inference speedups of 20.0% on multimodal large models and 15.4% on pure language models, across diverse GPU architectures and compute-intensive batched scenarios, all without any loss in accuracy.
๐ Abstract
Tree-based speculative decoding accelerates autoregressive generation by verifying a branching tree of draft tokens in a single target-model forward pass. However, existing methods prioritize maximizing token-level likelihood or the number of accepted tokens while ignoring a critical ``efficiency paradox'': the computational overhead of drafting and verifying big trees can grow super-linearly, particularly at scale. This often leads to negative wall-clock speedup when batch sizes increase or hardware saturation limits are reached. To address this, we propose SMART, a system-aware marginal analysis framework for runtime tree construction. SMART reformulates tree expansion as a hardware-aware optimization problem that directly maximizes end-to-end speedup. By applying a principled marginal benefit--cost rule at inference time, SMART expands a node only when its marginal benefit--cost ratio exceeds the tree-level speedup. SMART is training-free and serves as a plug-and-play controller for existing frameworks like MSD and EAGLE. Extensive evaluations across three MLLMs (e.g., LLaVA, Qwen2-VL) and four LLMs (e.g., Llama-3.1, DeepSeek-R1) demonstrate that SMART consistently outperforms state-of-the-art baselines. It delivers an average additional speedup of 20.0\% for MLLMs and 15.4\% for LLMs across compute-bound batching regimes and diverse GPU architectures without performance loss.