Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses reliability and efficiency challenges in large language model serving—such as excessive KV cache allocation, low resource utilization, out-of-memory errors, and preemptions—caused by mismatches between request lengths and system configurations. The authors propose a dual-pool token budget routing mechanism that partitions a homogeneous cluster into a high-throughput pool for short-context requests and a high-capacity pool for long-context requests. They introduce a novel online byte-to-token ratio estimator that operates without a tokenizer, leveraging exponential moving averages and a parsing-cost-aware model to enable lightweight, PagedAttention-compatible scheduling. Evaluated on Llama-3-70B and Qwen3-235B, the approach reduces GPU-hour consumption by 31–42% (translating to annual cost savings of $2.86M–$15.4M), decreases preemption rates by 5.4×, and improves P99 time-to-first-token latency by 6%.

Technology Category

Application Category

📝 Abstract

Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

Problem

Research questions and friction points this paper is trying to address.

LLM serving

KV-cache over-allocation

configuration-traffic mismatch

cost-efficiency

reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-budget routing

dual-pool scheduling

KV-cache optimization