Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses reliability and efficiency challenges in large language model serving—such as excessive KV cache allocation, low resource utilization, out-of-memory errors, and preemptions—caused by mismatches between request lengths and system configurations. The authors propose a dual-pool token budget routing mechanism that partitions a homogeneous cluster into a high-throughput pool for short-context requests and a high-capacity pool for long-context requests. They introduce a novel online byte-to-token ratio estimator that operates without a tokenizer, leveraging exponential moving averages and a parsing-cost-aware model to enable lightweight, PagedAttention-compatible scheduling. Evaluated on Llama-3-70B and Qwen3-235B, the approach reduces GPU-hour consumption by 31–42% (translating to annual cost savings of $2.86M–$15.4M), decreases preemption rates by 5.4×, and improves P99 time-to-first-token latency by 6%.
📝 Abstract
Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.
Problem

Research questions and friction points this paper is trying to address.

LLM serving
KV-cache over-allocation
configuration-traffic mismatch
cost-efficiency
reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

token-budget routing
dual-pool scheduling
KV-cache optimization
LLM serving efficiency
online token estimation
🔎 Similar Papers
No similar papers found.
X
Xunzhuo Liu
vLLM Semantic Router Project
Bowei He
Bowei He
City University of Hong Kong, MBZUAI
Data MiningLanguage ModelGenAI4ScienceAgentic AI
X
Xue Liu
vLLM Semantic Router Project, MBZUAI, McGill University, Mila
Andy Luo
Andy Luo
Unknown affiliation
H
Haichen Zhang
AMD
H
Huamin Chen
vLLM Semantic Router Project