From Myopic Selection to Long-Horizon Awareness: Sequential LLM Routing for Multi-Turn Dialogue

๐Ÿ“… 2026-04-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

228K/year
๐Ÿค– AI Summary
This work addresses the limitation of existing large language model (LLM) routing methods, which typically rely on single-turn decisions and fail to optimize cumulative performance over multi-turn dialogues. To overcome this, the authors propose DialRouter, the first framework to incorporate long-horizon sequential decision-making into LLM routing. DialRouter leverages Monte Carlo Tree Search (MCTS) to explore high-reward dialogue trajectories, distills the resulting policy into a lightweight router via policy distillation, and employs retrieval-augmented modeling to efficiently approximate future statesโ€”enabling high-quality inference without online search. Evaluated on both open-domain and domain-specific tasks, DialRouter significantly outperforms current routing strategies and individual LLMs, achieving higher task success rates while effectively balancing performance and computational cost.

Technology Category

Application Category

๐Ÿ“ Abstract
Multi-turn dialogue is the predominant form of interaction with large language models (LLMs). While LLM routing is effective in single-turn settings, existing methods fail to maximize cumulative performance in multi-turn dialogue due to interaction dynamics and delayed rewards. To address this challenge, we move from myopic, single-turn selection to long-horizon sequential routing for multi-turn dialogue. Accordingly, we propose DialRouter, which first performs MCTS to explore dialogue branches induced by different LLM selections and collect trajectories with high cumulative rewards. DialRouter then learns a lightweight routing policy from search-derived data, augmented with retrieval-based future state approximation, enabling multi-turn routing without online search. Experiments on both open-domain and domain-specific dialogue tasks across diverse candidate sets of both open-source and closed-source LLMs demonstrate that DialRouter significantly outperforms single LLMs and existing routing baselines in task success rate, while achieving a superior performance-cost trade-off when combined with a cost-aware reward.
Problem

Research questions and friction points this paper is trying to address.

multi-turn dialogue
LLM routing
long-horizon awareness
cumulative performance
delayed rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential LLM routing
long-horizon awareness
Monte Carlo Tree Search (MCTS)
multi-turn dialogue
retrieval-based future state approximation