FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

📅 2023-06-06
🏛️ arXiv.org
📈 Citations: 14
Influential: 2
📄 PDF
🤖 AI Summary
Current serverless platforms provide weak GPU support, resulting in low resource utilization and inference latency that fails to meet service-level objectives (SLOs). This paper addresses GPU inefficiency in serverless inference by proposing an SLO-aware dynamic model swapping mechanism enabling fine-grained SLO guarantees across multiple co-located functions sharing a single GPU. We introduce the first late-binding model swapping architecture and design an interference-aware scheduling algorithm—enabling millisecond-level SLO compliance for up to 100 concurrent functions on shared-GPU serverless infrastructure. Key techniques include asynchronous API redirection, GPU runtime sharing, pipelined execution, efficient memory management, and dynamic scheduling. Evaluation shows that a single node with four V100 GPUs stably supports over 100 inference functions while matching dedicated-GPU performance; in a six-node production deployment, the system continuously serves 1,000+ functions while satisfying their respective SLOs.
📝 Abstract
Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.
Problem

Research questions and friction points this paper is trying to address.

Enable efficient GPU sharing in serverless platforms for inference
Minimize latency overhead caused by dynamic model swapping
Reduce GPU provisioning costs for users and platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic GPU model swapping for efficient sharing
Asynchronous API redirection to minimize latency
Interference-aware scheduling with GPU interconnects
🔎 Similar Papers
2024-08-05International Symposium on High-Performance Computer ArchitectureCitations: 5
Minchen Yu
Minchen Yu
The Chinese University of Hong Kong, Shenzhen
cloud computingserverless computingbig data systemsmachine learning systems
A
Ao Wang
Alibaba Group
D
Dong-dong Chen
Hong Kong University of Science and Technology
Haoxuan Yu
Haoxuan Yu
Hong Kong University of Science and Technology
X
Xiaonan Luo
Hong Kong University of Science and Technology
Z
Zhuohao Li
Hong Kong University of Science and Technology
W
W. Wang
Hong Kong University of Science and Technology
Ruichuan Chen
Ruichuan Chen
Distinguished Member of Technical Staff @ Bell Labs
Cloud computingMachine learning systemsDecentralized systemsPrivacy
D
Dapeng Nie
Alibaba Group
Haoran Yang
Haoran Yang
Central South University
Graph Neural NetworksData MiningRecommendation Systems