BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the challenge of balancing cost and performance in large-scale deployment of large language models (LLMs), this paper proposes an adaptive multi-response routing framework. Departing from conventional single-response routing, the method dynamically selects models based on query difficulty and generates multiple candidate responses from small, low-cost models; a quality-aware filtering mechanism then selects the optimal response. A tunable quality threshold enables fine-grained trade-offs between accuracy and inference cost. The core innovation lies in tightly coupling multi-response sampling with model routing—reducing expensive LLM invocations while preserving response quality. Experiments on real-world datasets demonstrate that the framework achieves up to a 60% reduction in inference cost compared to baseline methods, with less than a 1% degradation in task performance.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired trade-off. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.

Problem

Research questions and friction points this paper is trying to address.

Optimizing cost-quality trade-off in LLM query routing

Reducing overuse of expensive large language models

Enhancing small model performance via multi-response sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive routing based on query difficulty

Multiple responses from small models for quality

Dynamic model selection with quality thresholds

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing

2024-08-24Citations: 0

AMD

San Jose, CA / Bellevue, WA / Austin, TX

Authors to Follow