🤖 AI Summary
This work addresses the challenge of low-latency scheduling for large language model inference under hard memory constraints, where the unknown response length and linear growth of KV cache memory with generation length complicate efficient resource management. The authors propose the Geometric Slicing Algorithm (GSA), which employs a geometric-phase structure to periodically restart tasks, thereby bounding memory exposure, and integrates an interleaved pipelining strategy to smooth overall memory consumption. GSA is the first algorithm to provide a constant competitive ratio guarantee for KV cache scheduling in non-clairvoyant offline batching settings—achieving at most 61.92 in general instances and improving to 32 under large-memory regimes. Furthermore, they design a clairvoyant variant, GBA, which attains approximation ratios of 10.67 (general) and 6.75 (large-memory), offering both strong theoretical guarantees and empirical robustness.
📝 Abstract
Large Language Model (LLM) inference presents a unique scheduling challenge due to the Key-Value (KV) cache, where a job's memory footprint grows linearly with the number of decoded tokens. This growth couples scheduling decisions with feasibility: a scheduler must minimize latency under a hard memory budget, yet the response lengths of requests are inherently unknown. While recent works have explored this problem either assuming clairvoyance -- exact knowledge of response lengths -- or relying on machine-learned predictions, obtaining robust performance guarantees without any prior knowledge of job sizes remains a theoretically fundamental and practically important open problem. In this work, we propose the Geometric Slicing Algorithm (GSA), the non-clairvoyant policy to achieve the first constant competitive ratio for this problem in the offline batch setting. GSA manages uncertainty through a geometric phase structure that periodically restarts jobs to bound memory exposure, combined with a staggered pipeline mechanism that enables high concurrency by smoothing aggregate memory consumption. We prove that GSA achieves a competitive ratio of at most 61.92 for general instances, improving to 32 in the large-memory regime. Our algorithmic framework also yields a clairvoyant counterpart, the Geometric Batching Algorithm (GBA), which achieves an approximation ratio of 10.67 for general instances and 6.75 in the large-memory regime -- significantly improving upon the best previously known bound of over 9000. Numerical experiments on real request traces demonstrate that our algorithms perform robustly while preserving these worst-case guarantees.