AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

πŸ“… 2025-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing LLM inference serving systems struggle to simultaneously satisfy highly heterogeneous user SLOs (e.g., latency constraints) and maximize system throughput. Method: This paper proposes the first SLO-customized LLM inference serving framework, featuring: (1) a draft-model logits–based mechanism for accurately predicting speculative decoding correctness; (2) a theoretically optimal, fine-grained token-tree construction and pruning algorithm; and (3) an SLO-aware, request-level scheduling policy. The framework enables on-demand SLO specification and runtime adaptive selection of optimal speculative decoding paths. Results: Under mixed multi-SLO workloads, our framework improves SLO compliance rate by 73% and effective throughput (goodput) by 74%, significantly outperforming state-of-the-art systems.

Technology Category

Application Category

πŸ“ Abstract
This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
Problem

Research questions and friction points this paper is trying to address.

SLO-Oriented Design
Language Model Services
Scalability and Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaServe
Precise Guess Decoding
Optimal Strategy for String Verification
πŸ”Ž Similar Papers
No similar papers found.