inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of GPU cluster sizing for large language model inference, where closed-form solutions are lacking. It presents the first end-to-end capacity planning tool that integrates M/G/c queueing theory, discrete-event simulation, and a physics-informed GPU performance model covering A10G, A100, and H100 architectures. The approach jointly models request queues, routing policies, and multi-pool configurations—including monolithic, dual-pool, and decoupled designs—enabling accurate resource optimization for heavy-tailed workloads without physical hardware. It satisfies P99 time-to-first-token (TTFT) service-level objectives while minimizing cost. Evaluated across seven real-world and synthetic workload scenarios, the method precisely identifies optimal GPU types, pool-splitting thresholds, and system bottlenecks, substantially outperforming conventional simplified analytical approaches.

Technology Category

Application Category

📝 Abstract
Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it.
Problem

Research questions and friction points this paper is trying to address.

LLM inference
fleet capacity planning
queueing theory
GPU allocation
TTFT SLO
Innovation

Methods, ideas, or system contributions that make the work stand out.

queueing theory
discrete-event simulation
LLM inference
fleet capacity planning
GPU performance modeling