Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency

πŸ“… 2025-05-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In speculative decoding of large language models (LLMs), dynamic token acceptance rates hinder accurate inference latency estimation, leading to inefficient request scheduling. To address this, we propose LAPS-SD, a semi-prophetic scheduling algorithm. LAPS-SD introduces a novel semi-prophetic awareness mechanism that jointly models both *served* and *pending* request states, enabling fine-grained, priority-aware preemption across multiple queues. Crucially, it dynamically estimates per-request execution time by adaptively tracking real-time acceptance rates. Integrated transparently into existing speculative decoding systems, LAPS-SD achieves a 39% reduction in average inference latency over state-of-the-art schedulers under high-concurrency workloads, while significantly improving throughput and response-time consistency.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When the token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scheduling uncertain LLM inference requests
Minimizing latency in speculative decoding systems
Adapting to dynamic token acceptance rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses semi-clairvoyant scheduling for speculative decoding
Adapts to dynamic token acceptance rates
Preempts requests across priority queues
πŸ”Ž Similar Papers
No similar papers found.
R
Ruixiao Li
School of Cyber Science and Engineering, Xi’an Jiaotong University
Fahao Chen
Fahao Chen
the University of Aizu
Cloud computingmachine learning
P
Peng Li
School of Cyber Science and Engineering, Xi’an Jiaotong University