ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

📅 2025-02-13

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address challenges in LLM inference serving on heterogeneous cloud environments—including diverse GPU architectures, fluctuating network bandwidth, and low resource utilization—this paper proposes HeteroLLM, a cost- and performance-aware inference service system. Its key contributions are: (1) a heterogeneity-aware fine-grained scheduling algorithm that jointly models GPU compute capacity, VRAM size, and network bandwidth; (2) a lightweight online hot re-scheduling mechanism enabling seamless failure recovery and load-adaptive reallocation; and (3) an end-to-end inference optimization stack integrating PagedAttention, continuous batching, and dynamic bandwidth adaptation. Extensive experiments demonstrate that, under identical budget constraints, HeteroLLM achieves 1.7× average throughput improvement (up to 2.1×) and 1.5× average SLO latency compliance rate improvement (up to 2.5×), significantly outperforming state-of-the-art systems.

Technology Category

Application Category

📝 Abstract

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$ imes$ and on average a $1.7 imes$ increase in throughput and achieves up to a 2.5$ imes$ and on average a $1.5 imes$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM serving in heterogeneous clouds

Enhancing cost-efficiency in cloud GPU deployment

Adapting to dynamic cloud network conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized scheduling algorithm for LLM serving.

Lightweight re-scheduling for dynamic cloud conditions.

Increased throughput and reduced latency in clouds.

🔎 Similar Papers

No similar papers found.