🤖 AI Summary
To address head-of-line (HOL) blocking in first-come-first-served (FCFS) scheduling for large language model (LLM) inference—which degrades tail latency and limits throughput for short requests—this paper proposes PARS, a prompt-aware pairwise learning-to-rank scheduler. PARS approximates shortest-remaining-processing-time (SRPT) via margin ranking loss to accurately predict the relative ordering of request response lengths. Its core innovation lies in formulating scheduling as a lightweight, prompt-sensitive pairwise comparison task, enabling cross-model generalization and seamless integration into the vLLM inference engine. Extensive experiments across diverse open-source LLMs and realistic workloads demonstrate that PARS significantly reduces tail latency (up to 42%) and improves end-to-end throughput (up to 31%), particularly enhancing service efficiency for complex reasoning tasks.
📝 Abstract
Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.