Online Multi-LLM Selection via Contextual Bandits under Unstructured Context Evolution

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the online model selection challenge in LLM services—caused by dynamically evolving, black-box-driven, and inherently unmodelable prompts—this paper proposes the first contextual bandit framework tailored to unstructured prompt evolution. Methodologically, we define a myopic regret metric and design a LinUCB-based algorithm that operates without forecasting future contexts; we further incorporate budget-aware and position-aware mechanisms to enable cost-sensitive, early-response-quality-prioritized sequential model selection. Our approach constitutes a lightweight online learning paradigm: it requires no offline fine-tuning, no supervised training, and no reliance on historical datasets. Extensive evaluations across multiple benchmarks demonstrate substantial improvements over existing LLM routing strategies, achieving superior trade-offs between accuracy and invocation cost. These results validate the feasibility of efficient, real-time adaptive scheduling of large language models in interactive deployment scenarios.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths, making it challenging to select the most suitable LLM for a given user query. We study the problem of adaptive multi-LLM selection in an online setting, where the learner interacts with users through multi-step query refinement and must choose LLMs sequentially without access to offline datasets or model internals. A key challenge arises from unstructured context evolution: the prompt dynamically changes in response to previous model outputs via a black-box process, which cannot be simulated, modeled, or learned. To address this, we propose the first contextual bandit framework for sequential LLM selection under unstructured prompt dynamics. We formalize a notion of myopic regret and develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction. We further introduce budget-aware and positionally-aware (favoring early-stage satisfaction) extensions to accommodate variable query costs and user preferences for early high-quality responses. Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training. Experiments on diverse benchmarks demonstrate that our methods outperform existing LLM routing strategies in both accuracy and cost-efficiency, validating the power of contextual bandits for real-time, adaptive LLM selection.

Problem

Research questions and friction points this paper is trying to address.

Adaptive multi-LLM selection in online settings with dynamic queries

Handling unstructured context evolution in sequential LLM selection

Balancing accuracy and cost-efficiency in real-time LLM routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual bandit framework for LLM selection

LinUCB-based algorithm with sublinear regret

Budget and positionally-aware extensions

🔎 Similar Papers

No similar papers found.