HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the prohibitively high inference cost of large language models (LLMs) in resource-constrained settings, this paper proposes HierRouter—a dynamic, multi-model collaborative inference framework based on hierarchical routing. Its core innovation lies in formulating hierarchical model scheduling as a finite-horizon Markov decision process for the first time and designing a Proximal Policy Optimization (PPO)-based reinforcement learning agent that enables context-aware, cost-sensitive, multi-hop model orchestration. By jointly optimizing response quality and computational overhead, HierRouter achieves up to a 2.4× improvement in response quality over a single large model across six benchmark tasks, while incurring only negligible additional inference latency. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs of large language models for resource-constrained environments

Dynamically assembling specialized lightweight models through intelligent routing decisions

Improving response quality while minimizing additional inference costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

HierRouter uses hierarchical routing for specialized models

Reinforcement learning agent selects models dynamically

PPO-based training optimizes multi-hop inference pipelines

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing