🤖 AI Summary
To address high tail latency in token generation—specifically elevated tail-time-to-first-token (TTFT) and tail-time-between-tokens (TBT)—caused by KV cache contention during large language model (LLM) inference, this paper proposes CacheOPT, a system that jointly optimizes response timeliness and throughput efficiency. Its core contributions are: (1) arrival-rate-adaptive probabilistic output-length prediction and proactive KV cache pre-allocation; (2) a global cache retention and dynamic reuse mechanism to mitigate fragmentation and redundancy; and (3) SLO-aware dynamic preemption scheduling, coupled with an execution strategy that jointly minimizes latency via intelligent swap-and-recomputation trade-offs. Extensive experiments demonstrate that CacheOPT reduces tail TTFT and tail TBT by 2.83× and 3.29×, respectively; improves TTFT and TBT SLO compliance rates by 47% and 53%; and achieves 1.58× higher throughput than the state-of-the-art method.
📝 Abstract
In Large Language Model (LLM) serving, the KV-cache (KVC) bottleneck causes high tail Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT), impairing user experience, particularly in time-sensitive applications. However, satisfying both TTFT and TBT service-level objectives (SLOs) is challenging. To address this, we propose a system, named CacheOPT for mitigating KV Cache competition, based on key insights from our measurements, incorporating novel components. First, it estimates a request's output length, bounding the deviation with a high specified probability, adjusted based on the request arrival rate. Second, it allocates the estimated KVC demand to a request, and reuses other requests' allocated KVC to avoid preemptions while reducing waiting time. Third, it proactively allocates KVC before instead of at the time a request exhausts its allocation and reserves KVC globally to prevent preemptions. Fourth, it chooses a request that has long TBT SLO, long job remaining time and short preemption time to preempt. Fifth, it selects the shortest-latency strategy between swapping and recomputation for preemptions. Experiments show that CacheOPT achieves up to 3.29$ imes$ and 2.83$ imes$ lower tail TBT and tail TTFT, 47% and 53% higher TTFT and TBT SLO attainments, and supports up to 1.58$ imes$ higher request arrival rate than the state-of-the-art methods.