🤖 AI Summary
This work addresses query scheduling for online large language model (LLM) services under stringent tail latency constraints on time-to-first-token (TTFT). We formally prove that prefix-aware scheduling—leveraging RadixAttention’s prefix reuse mechanism—is NP-hard, and identify inherent limitations of FCFS and longest-prefix-match (LPM) policies in meeting TTFT guarantees. To overcome these, we propose k-LPM, a dynamic scheduling algorithm that employs a radix-tree index for efficient prefix matching while jointly optimizing prefix reuse gains and request fairness. Crucially, k-LPM incorporates traffic-aware modeling to provide provable theoretical bounds on TTFT. Extensive experiments under real-world LLM traffic demonstrate that k-LPM significantly reduces P99 TTFT compared to state-of-the-art baselines, validating its effectiveness in enabling scalable, low-latency LLM inference under high concurrency.
📝 Abstract
The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT constraints and proposes a novel scheduling algorithm, $k$-LPM, which generalizes existing methods by balancing prefix reuse and fairness in query processing. Theoretical guarantees demonstrate that $k$-LPM achieves improved TTFT performance under realistic traffic patterns captured by a data generative model. Empirical evaluations in a realistic serving setting validates our findings, showing significant reductions in P99 TTFT compared to baseline methods.