LLM Query Scheduling with Prefix Reuse and Latency Constraints

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses query scheduling for online large language model (LLM) services under stringent tail latency constraints on time-to-first-token (TTFT). We formally prove that prefix-aware scheduling—leveraging RadixAttention’s prefix reuse mechanism—is NP-hard, and identify inherent limitations of FCFS and longest-prefix-match (LPM) policies in meeting TTFT guarantees. To overcome these, we propose k-LPM, a dynamic scheduling algorithm that employs a radix-tree index for efficient prefix matching while jointly optimizing prefix reuse gains and request fairness. Crucially, k-LPM incorporates traffic-aware modeling to provide provable theoretical bounds on TTFT. Extensive experiments under real-world LLM traffic demonstrate that k-LPM significantly reduces P99 TTFT compared to state-of-the-art baselines, validating its effectiveness in enabling scalable, low-latency LLM inference under high concurrency.

Technology Category

Application Category

📝 Abstract

The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, $k$-LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that $k$-LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.

Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM inference under latency constraints.

Addresses query scheduling with prefix reuse.

Proposes a novel algorithm for improved TTFT.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prefix reuse reduces computational overhead

RadixAttention enhances intermediate representation storage

k-LPM algorithm balances prefix reuse and fairness

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing