AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

📅 2025-01-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing LLM inference serving systems struggle to simultaneously satisfy highly heterogeneous user SLOs (e.g., latency constraints) and maximize system throughput. Method: This paper proposes the first SLO-customized LLM inference serving framework, featuring: (1) a draft-model logits–based mechanism for accurately predicting speculative decoding correctness; (2) a theoretically optimal, fine-grained token-tree construction and pruning algorithm; and (3) an SLO-aware, request-level scheduling policy. The framework enables on-demand SLO specification and runtime adaptive selection of optimal speculative decoding paths. Results: Under mixed multi-SLO workloads, our framework improves SLO compliance rate by 73% and effective throughput (goodput) by 74%, significantly outperforming state-of-the-art systems.

Technology Category

Application Category

📝 Abstract

This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.

Problem

Research questions and friction points this paper is trying to address.

SLO-Oriented Design

Language Model Services

Scalability and Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaServe

Precise Guess Decoding

Optimal Strategy for String Verification

🔎 Similar Papers

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput