SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address the challenge of simultaneously achieving low latency and strict service-level objective (SLO) compliance for large language model (LLM) inference under dynamic request patterns, this paper proposes an adaptive speculative decoding framework. The framework jointly optimizes draft generation, verification pruning, and lightweight model scheduling by real-time modeling of system state and workload dynamics. It introduces the first generalizable theoretical model to predict speculative efficiency and designs a hardware- and load-aware online feedback control mechanism. Evaluated on realistic request traces, our approach achieves 100% SLO compliance while improving inference throughput by 1.14×–14.3×. It fundamentally breaks the traditional performance–stability trade-off, delivering the first adaptive speculative decoding solution for dynamic LLM serving that provides both rigorous theoretical guarantees and demonstrated engineering efficacy.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14$ imes$-14.3$ imes$ speedups over state-of-the-art speculative inference systems.

Problem

Research questions and friction points this paper is trying to address.

Achieving low inference latency in LLM services

Adapting speculative decoding to dynamic workloads

Ensuring high SLO attainment in LLM inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic speculative decoding adaptation

Theoretical model for efficiency prediction

Intelligent drafting and verification algorithms

🔎 Similar Papers

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput