Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high operational costs and low resource utilization in large language model (LLM) serving on heterogeneous GPU cloud platforms—caused by suboptimal resource allocation—this paper proposes the first cost-effectiveness–driven joint optimization framework. It simultaneously optimizes heterogeneous GPU fleet selection, model deployment configurations, and dynamic request routing. Methodologically, we conduct systematic benchmarking to characterize the alignment between GPU architectures and request compute/memory requirements, then formulate a real-time availability–aware mixed-integer linear programming (MILP) scheduling model, augmented with multi-model workload profiling. Experiments under realistic workloads, fluctuating GPU supply, and multi-model co-location scenarios demonstrate that our approach reduces average service cost by 23.7% compared to homogeneous and state-of-the-art heterogeneous baselines, while significantly improving resource utilization and budget compliance rate.

Technology Category

Application Category

📝 Abstract
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Resource Allocation
GPU Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Heterogeneous GPU Resources
Cost-Efficient Processing
Large Language Models Optimization
🔎 Similar Papers
No similar papers found.
Y
Youhe Jiang
Department of Computer Science, University of Cambridge, Cambridgeshire, UK
Fangcheng Fu
Fangcheng Fu
Shanghai Jiao Tong University
machine learningdeep learningMLSysdistributed computation
Xiaozhe Yao
Xiaozhe Yao
ETH Zurich
Machine Learning SystemsMachine LearningLLMs
G
Guoliang He
Department of Computer Science, University of Cambridge, Cambridgeshire, UK
Xupeng Miao
Xupeng Miao
Purdue University
Machine Learning SystemsData Management
Ana Klimovic
Ana Klimovic
ETH Zurich
Computer systemsCloud ComputingComputer Architecture
B
Bin Cui
Department of Computer Science, Peking University, Beijing, China
B
Binhang Yuan
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Eiko Yoneki
Eiko Yoneki
Computer Laboratory, University of Cambridge
optimisationlarge-scale graph processingdistributed systems