🤖 AI Summary
This work addresses dynamic routing in multi-LLM systems, aiming to learn optimal routing policies solely from low-cost observational feedback—i.e., outputs from the actually deployed model per query—thereby avoiding reliance on expensive full-feedback supervision and error-prone decoupled “predict-then-select” paradigms. We propose the first causally grounded, end-to-end routing framework, introducing two theoretically justified surrogate objectives: a classification upper bound and a softmax-weighted regret. To capture heterogeneous cost preferences across models, we design an interval-conditional neural architecture. Evaluated on public benchmarks across diverse embedding models, our method achieves state-of-the-art performance, significantly outperforming existing approaches. Results demonstrate both the feasibility and superiority of learning robust, efficient routing policies using only observational data—without ground-truth labels or explicit error modeling.
📝 Abstract
LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.