🤖 AI Summary
Existing evaluations of routing mechanisms in collaborative large language models (LLMs) lack systematicity and often overlook scenario alignment and out-of-distribution robustness. To address this gap, this work proposes RouterXBench, a comprehensive evaluation framework that assesses routing strategies along three dimensions: routing capability, scenario alignment, and cross-domain robustness. Furthermore, we introduce ProbeDirichlet, a lightweight method that models uncertainty directly from internal LLM hidden states by aggregating multi-layer representations through a learnable Dirichlet distribution—eliminating the need for external embeddings or output probabilities. Experimental results demonstrate that ProbeDirichlet improves routing accuracy by 16.68% over the best baseline in general routing tasks and by 18.86% in high-precision scenarios, while maintaining consistent performance across diverse model architectures, scales, heterogeneous tasks, and agent-based workflows.
📝 Abstract
Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.