Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

📅 2026-03-13

📈 Citations: 2

✨ Influential: 1

career value

195K/year

🤖 AI Summary

This work addresses the mismatch between request length distributions and static resource allocation in large language model inference, which leads to underutilization for short requests and KV cache overflow for long ones. The authors propose a zero-overhead dynamic routing mechanism that employs a tokenizer-free, self-calibrating token budget estimator coupled with a closed-form cost model to predict total token counts online. Requests are then routed to dedicated vLLM pools optimized for short or long contexts. Integrating exponential moving average learning of byte-to-token ratios, discrete-event simulation, PagedAttention, and continuous batching, the system reduces GPU instance usage by 17–39% on Llama-3-70B and Qwen3-235B, translating to annualized cost savings of up to $15.4 million.

Technology Category

Application Category

📝 Abstract

Production vLLM fleets provision every instance for worst-case context length, wasting 4-8x concurrency on the 80-95% of requests that are short and simultaneously triggering KV-cache failures -- OOM crashes, preemption storms, and request rejections. Both problems share a single root cause: configuration-traffic mismatch. We propose token-budget-aware pool routing: estimate each request's total token budget using a self-calibrating per-category bytes-per-token ratio, then dispatch it to one of two vLLM pools -- a high-throughput short pool or a high-capacity long pool -- each right-sized for its workload class. The ratio is learned online via exponential moving average from usage.prompt_tokens feedback, requiring no tokenizer. A closed-form cost model, savings = alpha * (1 - 1/rho), predicts fleet-level GPU savings from two observable quantities: the short-traffic fraction alpha and the throughput gain ratio rho. On traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M serving Llama-3-70B on A100 GPUs, token-budget routing reduces GPU instances by 17-39% ($1.2-2.0M/yr at 1,000 req/s), with savings verified by a self-contained discrete-event simulator. A case study projecting Qwen3-235B-A22B on AMD MI300X at 10,000 req/s shows $15.4M/yr in savings. The algorithm adds O(1) dispatch overhead, self-calibrates across content types without a tokenizer, and composes with PagedAttention, continuous batching, and prefill-decode disaggregation.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

resource provisioning

KV-cache failures

configuration-traffic mismatch

cost efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-budget-aware routing

cost-efficient LLM inference

pool-based dispatch