SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

📅 2024-11-07

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing speculative decoding methods struggle to efficiently handle frequent, repetitive, and long-horizon predictable inference requests common in LLM-agent scenarios. This paper proposes SuffixDecoding—a novel, model-agnostic speculative decoding paradigm that operates entirely on CPU memory. Its core innovation is a dynamically maintained suffix tree over historical outputs, coupled with an interpretable, empirically calibrated token-frequency scoring mechanism to enable lightweight tree-based speculation and adaptive pruning. Compared to SpecInfer, SuffixDecoding achieves 1.4× higher throughput and 1.1× lower time-per-output-token (TPOT) latency in open-domain dialogue and code generation; in text-to-SQL tasks, it attains 2.9× throughput improvement and reduces latency to one-third, while sustaining high acceptance rates even under few-shot settings (256 examples). To our knowledge, this is the first speculative decoding framework that eliminates the need for a draft model, runs fully on CPU, and provides human-interpretable speculation decisions.

Technology Category

Application Category

📝 Abstract

We present SuffixDecoding, a novel model-free approach to accelerating large language model (LLM) inference through speculative decoding. Unlike existing methods that rely on draft models or specialized decoding heads, SuffixDecoding leverages suffix trees built from previously generated outputs to efficiently predict candidate token sequences. Our approach enables flexible tree-structured speculation without the overhead of maintaining and orchestrating additional models. SuffixDecoding builds and dynamically updates suffix trees to capture patterns in the generated text, using them to construct speculation trees through a principled scoring mechanism based on empirical token frequencies. SuffixDecoding requires only CPU memory which is plentiful and underutilized on typical LLM serving nodes. We demonstrate that SuffixDecoding achieves competitive speedups compared to model-based approaches across diverse workloads including open-domain chat, code generation, and text-to-SQL tasks. For open-ended chat and code generation tasks, SuffixDecoding achieves up to $1.4 imes$ higher output throughput than SpecInfer and up to $1.1 imes$ lower time-per-token (TPOT) latency. For a proprietary multi-LLM text-to-SQL application, SuffixDecoding achieves up to $2.9 imes$ higher output throughput and $3 imes$ lower latency than speculative decoding. Our evaluation shows that SuffixDecoding maintains high acceptance rates even with small reference corpora of 256 examples, while continuing to improve performance as more historical outputs are incorporated.

Problem

Research questions and friction points this paper is trying to address.

Exploiting repetitive inference requests in agentic AI workloads

Improving speculative decoding for long predictable sequences

Enhancing LLM inference speed via adaptive suffix tree caching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses suffix trees for caching token sequences

Adaptively adjusts speculation length based on likelihood

Achieves significant speedups in agentic benchmarks

🔎 Similar Papers

No similar papers found.