PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address head-of-line (HOL) blocking in first-come-first-served (FCFS) scheduling for large language model (LLM) inference—which degrades tail latency and limits throughput for short requests—this paper proposes PARS, a prompt-aware pairwise learning-to-rank scheduler. PARS approximates shortest-remaining-processing-time (SRPT) via margin ranking loss to accurately predict the relative ordering of request response lengths. Its core innovation lies in formulating scheduling as a lightweight, prompt-sensitive pairwise comparison task, enabling cross-model generalization and seamless integration into the vLLM inference engine. Extensive experiments across diverse open-source LLMs and realistic workloads demonstrate that PARS significantly reduces tail latency (up to 42%) and improves end-to-end throughput (up to 31%), particularly enhancing service efficiency for complex reasoning tasks.

Technology Category

Application Category

📝 Abstract
Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addresses Head-of-Line blocking in LLM inference scheduling
Improves latency through prompt-aware shortest-job-first approximation
Enables cross-model generalization for efficient task scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

PARS scheduler uses pairwise ranking for task ordering
Integrates prompt-aware scheduling into vLLM system
Predicts response length to reduce latency efficiently
🔎 Similar Papers
No similar papers found.
Y
Yiheng Tao
University of Illinois Chicago
Yihe Zhang
Yihe Zhang
Research Scientist, University of Louisiana at Lafayette
AI SecuritySocial Network Security
M
M. Dearing
University of Illinois Chicago
X
Xin Wang
University of Illinois Chicago
Y
Yuping Fan
Argonne National Laboratory
Zhiling Lan
Zhiling Lan
Professor of Computer Science, University of Illinois Chicago
cluster schedulingenergy efficiencyAI4Sysmodeling and simulationresilience