Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing evaluations of routing mechanisms in collaborative large language models (LLMs) lack systematicity and often overlook scenario alignment and out-of-distribution robustness. To address this gap, this work proposes RouterXBench, a comprehensive evaluation framework that assesses routing strategies along three dimensions: routing capability, scenario alignment, and cross-domain robustness. Furthermore, we introduce ProbeDirichlet, a lightweight method that models uncertainty directly from internal LLM hidden states by aggregating multi-layer representations through a learnable Dirichlet distribution—eliminating the need for external embeddings or output probabilities. Experimental results demonstrate that ProbeDirichlet improves routing accuracy by 16.68% over the best baseline in general routing tasks and by 18.86% in high-precision scenarios, while maintaining consistent performance across diverse model architectures, scales, heterogeneous tasks, and agent-based workflows.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

Problem

Research questions and friction points this paper is trying to address.

router evaluation

collaborative LLM systems

out-of-distribution robustness

scenario alignment

fair evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RouterXBench

ProbeDirichlet

hidden states