🤖 AI Summary
To address the scalability lag and poor burst-load responsiveness in serverless LLM inference—caused by high model loading overhead—this paper proposes λPipe, a distributed inference architecture integrating RDMA-accelerated multicast with an execute-while-load mechanism. Its core contributions are threefold: (1) an adaptive multi-node pipelined scheduler enabling hierarchical model management across GPU and host memory, along with real-time collaborative inference; (2) heterogeneous memory–aware dynamic model loading; and (3) deep optimizations of the serverless runtime. Evaluated on realistic LLM inference traces, λPipe reduces tail latency by up to 5× and lowers service cost by 31.3% compared to baseline approaches, significantly improving both responsiveness under bursty workloads and resource efficiency.
📝 Abstract
Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce {lambda}Scale, an efficient serverless inference system to achieve fast model scaling. The key idea behind {lambda}Scale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as"execute-while-load". {lambda}Scale proposes an efficient model scaling scheme, {lambda}Pipe, which supports adaptive model multicast and dynamically constructs execution pipelines across receiving nodes for collaborative, distributed inference. Additionally, {lambda}Scale supports efficient model management across GPU and host memory, allowing fast scaling for models across different storage tiers. Evaluation results show that {lambda}Scale enables fast model scaling and effectively handles load spikes, achieving up to 5x tail-latency improvement and 31.3% cost reduction compared to state-of-the-art solutions on real-world LLM inference traces.