An Interpretable Latency Model for Speculative Decoding in LLM Serving

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the lack of interpretable modeling for the latency behavior of speculative decoding under dynamic workloads in real-world LLM serving. It introduces, for the first time, a structured latency analysis framework tailored to dynamic request loads, leveraging Little’s Law to derive effective batch sizes and decomposing per-request computational overhead into load-independent and load-dependent components across prefill, draft generation, and verification phases. By integrating queueing theory, a latency decomposition model, and empirical measurements from vLLM, the study systematically examines the impact of key parameters—including draft length, acceptance rate, and model scale. The resulting model accurately predicts real-world latency, elucidates the mechanism behind diminishing speedup gains as load increases, and offers actionable guidance for optimizing draft length selection and model deployment, with extensions to mixture-of-experts architectures.

📝 Abstract

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

LLM serving

latency modeling

effective batch size

request load

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

latency modeling

LLM serving