Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Routing queries across heterogeneous large language models (LLMs) in large-scale deployment faces challenges from model capability diversity, volatile service costs, and heterogeneous user preferences. Method: We propose an online adaptive model routing framework that jointly models contextual bandit decision-making and user preference vectors, enabling end-to-end trainable dynamic routing under partial real-world feedback—without requiring offline full-labeling or model retraining. The method leverages joint contextual representations of prompt features and user preferences, and optimizes routing policies via reinforcement learning; inference-time preference vector adjustment enables real-time accuracy–cost trade-offs. Contribution/Results: Experiments show our approach outperforms offline routing baselines by 12.46% and surpasses using the single largest LLM by 2.45%, while demonstrating strong generalization to unseen tasks.

Technology Category

Application Category

📝 Abstract
Efficient use of large language models (LLMs) is critical for deployment at scale: without adaptive routing, systems either overpay for strong models or risk poor performance from weaker ones. Selecting the right LLM for each query is fundamentally an online decision problem: models differ in strengths, prices fluctuate, and users value accuracy and cost differently. Yet most routers are trained offline with labels for all candidate models, an assumption that breaks in deployment, where only the outcome of the chosen model is observed. We bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach that trains under the same partial-feedback restriction as deployment, while supporting preference-tunable inference: operators can dial the performance/cost trade-off at test time without retraining. Framed as a contextual bandit over prompt features and a user preference vector, our method simulates an online feedback setting during training and adapts its routing decisions to each new prompt, rather than depending on full-information offline supervision. Comprehensive experiments show that our method consistently outperforms strong offline routers by at least 12.46% and the largest LLM by at least 2.45%, and generalizes robustly for unseen tasks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM selection for cost-performance trade-offs
Training routers with partial feedback instead of full supervision
Enabling tunable routing preferences without model retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bandit-feedback routing with preferences approach
Online training under partial-feedback deployment constraints
Adaptive routing based on prompt features and preferences
🔎 Similar Papers
No similar papers found.