Dynamically Learned Test-Time Model Routing in Language Model Zoos with Service Level Guarantees

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In the rapidly expanding “zoo” of open-weight LLMs, non-expert users struggle to select appropriate models, while service providers face the challenge of balancing inference quality and cost. Method: We propose the first request-level dynamic routing framework integrating virtual queuing with real-time user satisfaction prediction. Under hard SLA constraints—encompassing factual correctness, safety, and user satisfaction—the framework jointly employs stochastic optimization and online probabilistic modeling to ensure theoretically guaranteed SLA compliance and inference cost optimality, while enabling continual online learning of satisfaction distributions. Contribution/Results: Experiments on mainstream LLM benchmarks demonstrate an average 50% reduction in inference cost, strict adherence to multi-dimensional SLA requirements, and scalable, verifiable scheduling—establishing a foundational paradigm for Open Model-as-a-Service (O-MaaS).

Technology Category

Application Category

📝 Abstract

Open-weight LLM zoos provide access to numerous high-quality models, but selecting the appropriate model for specific tasks remains challenging and requires technical expertise. Most users simply want factually correct, safe, and satisfying responses without concerning themselves with model technicalities, while inference service providers prioritize minimizing operating costs. These competing interests are typically mediated through service level agreements (SLAs) that guarantee minimum service quality. We introduce MESS+, a stochastic optimization algorithm for cost-optimal LLM request routing while providing rigorous SLA compliance guarantees. MESS+ learns request satisfaction probabilities of LLMs in real-time as users interact with the system, based on which model selection decisions are made by solving a per-request optimization problem. Our algorithm includes a novel combination of virtual queues and request satisfaction prediction, along with a theoretical analysis of cost optimality and constraint satisfaction. Across a wide range of state-of-the-art LLM benchmarks, MESS+ achieves an average of 2x cost savings compared to existing LLM routing techniques.

Problem

Research questions and friction points this paper is trying to address.

Selecting optimal LLM for tasks without technical expertise

Balancing user satisfaction and provider cost efficiency

Ensuring SLA compliance in dynamic model routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic real-time model routing optimization

Virtual queues and satisfaction prediction

Cost-optimal SLA-compliant LLM selection

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing