🤖 AI Summary
To address the low inference efficiency of large language models (LLMs) and the limitations of existing speculative decoding methods—particularly regarding budget scalability, hyperparameter robustness, and hardware adaptability—this paper proposes a scalable, robust, and hardware-aware speculative decoding framework. Our key contributions are: (1) the first dynamic programming-based method for constructing optimal speculative token trees; (2) a temperature-agnostic robust sampling scheme coupled with multi-step verification to ensure correctness under diverse generation conditions; and (3) a hardware-aware tree optimizer that jointly models GPU memory capacity and computational bandwidth constraints to automatically determine optimal token tree size and depth. Experiments demonstrate substantial speedups: 4.04×, 3.73×, and 2.27× on A100 for Llama2-7B, Llama2-13B, and Vicuna-33B, respectively; and 19.5× acceleration achieving 0.56 s/token on L40 for Llama2-70B.
📝 Abstract
As the usage of large language models (LLMs) grows, performing efficient inference with these models becomes increasingly important. While speculative decoding has recently emerged as a promising direction for speeding up inference, existing methods are limited in their ability to scale to larger speculation budgets, and adapt to different hyperparameters and hardware. This paper introduces Sequoia, a scalable, robust, and hardware-aware algorithm for speculative decoding. To attain better scalability, Sequoia introduces a dynamic programming algorithm to find the optimal tree structure for the speculated tokens. To achieve robust speculative performance, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Finally, Sequoia introduces a hardware-aware tree optimizer that maximizes speculative performance by automatically selecting the token tree size and depth for a given hardware platform. Evaluation shows that Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 by up to $4.04 imes$, $3.73 imes$, and $2.27 imes$. For offloading setting on L40, Sequoia achieves as low as 0.56 s/token for exact Llama2-70B inference latency, which is $9.96 imes$ on our optimized offloading system (5.6 s/token), $9.7 imes$ than DeepSpeed-Zero-Inference, $19.5 imes$ than Huggingface Accelerate.