🤖 AI Summary
To address the challenge of simultaneously achieving low latency and strict service-level objective (SLO) compliance for large language model (LLM) inference under dynamic request patterns, this paper proposes an adaptive speculative decoding framework. The framework jointly optimizes draft generation, verification pruning, and lightweight model scheduling by real-time modeling of system state and workload dynamics. It introduces the first generalizable theoretical model to predict speculative efficiency and designs a hardware- and load-aware online feedback control mechanism. Evaluated on realistic request traces, our approach achieves 100% SLO compliance while improving inference throughput by 1.14×–14.3×. It fundamentally breaks the traditional performance–stability trade-off, delivering the first adaptive speculative decoding solution for dynamic LLM serving that provides both rigorous theoretical guarantees and demonstrated engineering efficacy.
📝 Abstract
Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to varying workloads and system environments, resulting in performance variability and SLO violations. In this paper, we introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. SpecServe proposes a theoretical model to understand and predict the efficiency of speculative decoding across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to guarantee optimal performance while achieving high SLO attainment. Experimental results on real-world LLM traces demonstrate that SpecServe consistently meets SLOs and achieves substantial performance improvements, yielding 1.14$ imes$-14.3$ imes$ speedups over state-of-the-art speculative inference systems.