Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM routers employ single-turn, one-to-one model selection, limiting their ability to coordinate multiple models for complex tasks. Method: We propose Router-R1—the first reinforcement learning framework that models the router as a reasoning-capable LLM and formalizes multi-model routing and response aggregation as a sequential decision-making process. Contribution/Results: Router-R1 introduces (1) multi-turn “think-route-aggregate” interaction; (2) a lightweight ternary reward—covering format compliance, answer correctness, and computational cost—enabling strong zero-shot generalization to unseen models using only lightweight model descriptors (e.g., price, latency, example performance); and (3) dynamic context maintenance, rule-guided sparse reward shaping, and PPO-based optimization. Evaluated on seven diverse benchmarks—including general and multi-hop QA—Router-R1 significantly outperforms strong baselines (e.g., single-turn routing), achieving superior accuracy, robustness, and efficiency in both computation and API call cost.

Technology Category

Application Category

📝 Abstract

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping ( extit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present extbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave"think"actions (internal deliberation) with"route"actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

Problem

Research questions and friction points this paper is trying to address.

Existing LLM routers lack multi-round, multi-model routing for complex tasks.

Router-R1 uses RL to optimize dynamic LLM selection and aggregation.

The framework balances performance and cost via rule-based rewards.

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based multi-LLM routing and aggregation

Dynamic model invocation with cost optimization

Generalization via simple model descriptors

🔎 Similar Papers

No similar papers found.

Authors to Follow