FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency in current LLM service clusters, which provision GPUs based on the maximum context length, leading to substantial waste of KV cache resources for short requests. The authors propose a dual-pool architecture that determines cost-optimal resource allocation by leveraging the workload’s prompt length distribution and a P99 first-token latency target, using an M/G/c queueing model combined with an offline analytical planner. Central to their approach is the novel “Compress-and-Route” mechanism, which softens hardware boundaries into tunable software parameters. Evaluated across three production traces, the method reduces GPU costs by 6%–82% compared to homogeneous clusters, with Compress-and-Route alone contributing 1–44 percentage points of savings. Simulations further demonstrate that utilization prediction errors remain within 3%.

Technology Category

Application Category

📝 Abstract
Modern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload's prompt-length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically, then deploy it in practice. The analytical core models each pool as an M/G/c queue and derives that the minimum-cost fleet is a two-pool architecture -- a short-context pool and a long-context pool -- with an optimal boundary B* satisfying an equal marginal GPU cost condition across both pools. The fundamental barrier to achieving B* is the cost cliff: a hard routing step where requests just above B* consume 8x--42x more GPU capacity than requests just below it (depending on the context window ratio), creating a structural disincentive to lower the boundary. Compress-and-Route (C&R) is the implementation mechanism that resolves this barrier. Gateway-layer extractive compression trims borderline requests below B* before the engine ever sees them, converting the hard hardware boundary into a software parameter read from the workload CDF. The two components are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal (n_s*, n_l*, B*, gamma*) in under 1 ms. On three production traces, the combined framework reduces total GPU cost by 6--82% versus a homogeneous fleet, with C&R contributing 1--44 percentage points beyond plain pool routing depending on workload archetype. The analytical model is validated against a discrete-event simulator (inference-fleet-sim) with <= 3% error on predicted GPU utilization across all pools and workloads.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
GPU fleet provisioning
KV-cache waste
context length
resource over-provisioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

FleetOpt
Compress-and-Route
LLM inference
cost-optimal fleet provisioning
KV-cache optimization
🔎 Similar Papers
No similar papers found.