RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of meeting latency constraints in multi-model large language model (LLM) serving, where the tight coupling between request routing and resource allocation renders traditional approaches ineffective. To tackle this issue, the paper presents the first joint optimization framework that simultaneously models both decisions. The approach constructs a deployment-aware latency model based on empirical system measurements and leverages a dual pricing mechanism to solve the constrained optimization problem under latency service-level objectives (SLOs). Experimental results demonstrate that, on the same GPU cluster, varying resource allocations can lead to up to an 87% difference in output quality, highlighting the critical importance of co-optimizing routing and resource provisioning. The proposed framework effectively enhances service quality while rigorously satisfying latency requirements.

Technology Category

Application Category

📝 Abstract

Multi-model LLM routing has emerged as an effective approach for reducing serving cost and latency while maintaining output quality by assigning each prompt to an appropriate model. However, prior routing methods typically assume that each model has a fixed latency. In real deployments, this assumption is inaccurate: multiple models often share limited GPU resources, and a model's latency depends strongly on both its allocated resources and the request load induced by the routing policy. Consequently, routing and resource allocation are tightly coupled. In this work, we study joint resource allocation and routing for latency-aware multi-model LLM serving in GPU clusters. Given a set of deployed models and a latency service-level objective (SLO), we seek a system setup and routing policy that maximize overall output quality while satisfying the latency target. We formalize this problem as a constrained joint optimization over deployment setup and routing fractions, and propose RouterWise, which combines a dual-price formulation for score-maximizing routing with setup-specific latency models derived from system profiling. RouterWise searches over feasible system setups and, for each fixed setup, computes the best routing policy under the latency target. Our results show that even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.

Problem

Research questions and friction points this paper is trying to address.

multi-model LLM serving

resource allocation

request routing

latency SLO

GPU clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

joint optimization

resource allocation

latency-aware routing