LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing

📅 2026-01-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of a unified, large-scale evaluation benchmark in existing large language model (LLM) routing research, which has hindered reliable method comparison and drawn questionable conclusions. We present the first standardized and reproducible LLM routing evaluation framework, encompassing over 400,000 samples across 21 datasets and 33 models, integrating 10 representative routing baselines, and enabling comprehensive assessment under both performance-driven and cost-performance trade-off settings. Through multi-dataset aggregation, unified metrics, latency-aware analysis, and embedding ablation studies, we uncover key insights: current routing methods exhibit highly convergent performance, commercial routers offer limited advantages, and the primary bottleneck limiting model complementarity and widening the gap to Oracle performance stems from model recall failures.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented routing and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity-the central premise of LLM routing-we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.

Problem

Research questions and friction points this paper is trying to address.

LLM routing

benchmark

model selection

performance-cost trade-off

Oracle gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM routing

benchmark

model ensemble