{lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

📅 2025-02-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scalability lag and poor burst-load responsiveness in serverless LLM inference—caused by high model loading overhead—this paper proposes λPipe, a distributed inference architecture integrating RDMA-accelerated multicast with an execute-while-load mechanism. Its core contributions are threefold: (1) an adaptive multi-node pipelined scheduler enabling hierarchical model management across GPU and host memory, along with real-time collaborative inference; (2) heterogeneous memory–aware dynamic model loading; and (3) deep optimizations of the serverless runtime. Evaluated on realistic LLM inference traces, λPipe reduces tail latency by up to 5× and lowers service cost by 31.3% compared to baseline approaches, significantly improving both responsiveness under bursty workloads and resource efficiency.

Technology Category

Application Category

📝 Abstract
Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce {lambda}Scale, an efficient serverless inference system to achieve fast model scaling. The key idea behind {lambda}Scale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as"execute-while-load". {lambda}Scale proposes an efficient model scaling scheme, {lambda}Pipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, {lambda}Scale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that {lambda}Scale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.
Problem

Research questions and friction points this paper is trying to address.

Addresses serverless platform scaling inefficiencies
Enhances large language model inference speed
Reduces model startup overhead and costs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages RDMA for fast model multicast
Introduces execute-while-load technique
Supports adaptive model multicast via λPipe
🔎 Similar Papers
No similar papers found.
Minchen Yu
Minchen Yu
The Chinese University of Hong Kong, Shenzhen
cloud computingserverless computingbig data systemsmachine learning systems
R
Rui Yang
University of Virginia
C
Chaobo Jia
The Chinese University of Hong Kong, Shenzhen
Z
Zhaoyuan Su
University of Virginia
S
Sheng Yao
Hong Kong University of Science and Technology
Tingfeng Lan
Tingfeng Lan
Department of Computer Science, University of Virginia
ML system
Y
Yuchen Yang
Hong Kong University of Science and Technology
Y
Yue Cheng
University of Virginia
W
Wei Wang
Hong Kong University of Science and Technology
A
Ao Wang
Alibaba Group
Ruichuan Chen
Ruichuan Chen
Distinguished Member of Technical Staff @ Bell Labs
Cloud computingMachine learning systemsDecentralized systemsPrivacy