π€ AI Summary
Existing LLM inference serving systems struggle to simultaneously satisfy highly heterogeneous user SLOs (e.g., latency constraints) and maximize system throughput. Method: This paper proposes the first SLO-customized LLM inference serving framework, featuring: (1) a draft-model logitsβbased mechanism for accurately predicting speculative decoding correctness; (2) a theoretically optimal, fine-grained token-tree construction and pruning algorithm; and (3) an SLO-aware, request-level scheduling policy. The framework enables on-demand SLO specification and runtime adaptive selection of optimal speculative decoding paths. Results: Under mixed multi-SLO workloads, our framework improves SLO compliance rate by 73% and effective throughput (goodput) by 74%, significantly outperforming state-of-the-art systems.
π Abstract
This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.