SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously achieving low latency and strict service-level objective (SLO) compliance for large language model (LLM) inference under dynamic request patterns, this paper proposes an adaptive speculative decoding framework. The framework jointly optimizes draft generation, verification pruning, and lightweight model scheduling by real-time modeling of system state and workload dynamics. It introduces the first generalizable theoretical model to predict speculative efficiency and designs a hardware- and load-aware online feedback control mechanism. Evaluated on realistic request traces, our approach achieves 100% SLO compliance while improving inference throughput by 1.14×–14.3×. It fundamentally breaks the traditional performance–stability trade-off, delivering the first adaptive speculative decoding solution for dynamic LLM serving that provides both rigorous theoretical guarantees and demonstrated engineering efficacy.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14$ imes$-14.3$ imes$ speedups over state-of-the-art speculative inference systems.
Problem

Research questions and friction points this paper is trying to address.

Achieving low inference latency in LLM services
Adapting speculative decoding to dynamic workloads
Ensuring high SLO attainment in LLM inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic speculative decoding adaptation
Theoretical model for efficiency prediction
Intelligent drafting and verification algorithms
🔎 Similar Papers
No similar papers found.
K
Kaiyu Huang
Tongji University, Shenzhen Research Institute of Big Data
H
Hao Wu
Huazhong University of Science and Technology
Z
Zhubo Shi
Tongji University
Han Zou
Han Zou
Meta
Multimodal AI
Minchen Yu
Minchen Yu
The Chinese University of Hong Kong, Shenzhen
cloud computingserverless computingbig data systemsmachine learning systems
Q
Qingjiang Shi
Tongji University, Shenzhen Research Institute of Big Data