BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

📅 2026-02-11

📈 Citations: 1

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes a co-optimization framework for cost-effective large language model (LLM) serving under latency and quality constraints by jointly optimizing heterogeneous LLM query routing and GPU deployment strategies. It introduces multi-objective Bayesian optimization (MOBO) into LLM serving for the first time, enabling end-to-end co-design of model routing, GPU resource allocation, and parallelism configuration. Experimental results demonstrate that, under identical cost and quality requirements, the proposed approach improves system throughput by 59% on average (up to 157%). Conversely, when maintaining target performance levels, it reduces service costs by 38% on average, with reductions ranging from 15% to 61%.

Technology Category

Application Category

📝 Abstract

The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (i) An algorithmic perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (ii) a systems perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (i) Determining optimal query routing strategies under latency and quality requirements, (ii) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (iii) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a quality-aware scheduling system that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a multi-objective Bayesian optimization (MOBO) framework to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that BOute outperforms state-of-the-art LLM serving systems by up to 157% and 59% on average under identical cost budgets and quality requirements, or reducing serving costs by 15%-61% (38% on average) while maintaining the same performance targets, validating its effectiveness in achieving cost-efficient LLM serving.

Problem

Research questions and friction points this paper is trying to address.

cost-efficient LLM serving

heterogeneous LLMs

heterogeneous GPUs

query routing

model deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous LLMs

heterogeneous GPUs

multi-objective Bayesian optimization