PARS: Low-Latency LLM Serving via Pairwise Learning-to-Rank

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address head-of-line (HOL) blocking in first-come-first-served (FCFS) scheduling for large language model (LLM) inference—which degrades tail latency and limits throughput for short requests—this paper proposes PARS, a prompt-aware pairwise learning-to-rank scheduler. PARS approximates shortest-remaining-processing-time (SRPT) via margin ranking loss to accurately predict the relative ordering of request response lengths. Its core innovation lies in formulating scheduling as a lightweight, prompt-sensitive pairwise comparison task, enabling cross-model generalization and seamless integration into the vLLM inference engine. Extensive experiments across diverse open-source LLMs and realistic workloads demonstrate that PARS significantly reduces tail latency (up to 42%) and improves end-to-end throughput (up to 31%), particularly enhancing service efficiency for complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses Head-of-Line blocking in LLM inference scheduling

Improves latency through prompt-aware shortest-job-first approximation

Enables cross-model generalization for efficient task scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

PARS scheduler uses pairwise ranking for task ordering

Integrates prompt-aware scheduling into vLLM system

Predicts response length to reduce latency efficiently

🔎 Similar Papers

No similar papers found.