FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

📅 2023-06-06

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 2

career value

229K/year

🤖 AI Summary

Current serverless platforms provide weak GPU support, resulting in low resource utilization and inference latency that fails to meet service-level objectives (SLOs). This paper addresses GPU inefficiency in serverless inference by proposing an SLO-aware dynamic model swapping mechanism enabling fine-grained SLO guarantees across multiple co-located functions sharing a single GPU. We introduce the first late-binding model swapping architecture and design an interference-aware scheduling algorithm—enabling millisecond-level SLO compliance for up to 100 concurrent functions on shared-GPU serverless infrastructure. Key techniques include asynchronous API redirection, GPU runtime sharing, pipelined execution, efficient memory management, and dynamic scheduling. Evaluation shows that a single node with four V100 GPUs stably supports over 100 inference functions while matching dedicated-GPU performance; in a six-node production deployment, the system continuously serves 1,000+ functions while satisfying their respective SLOs.

📝 Abstract

Serverless computing has become increasingly popular for machine learning inference. However, current serverless platforms lack efficient support for GPUs, limiting their ability to deliver low-latency inference. In this paper, we propose FaaSwap, a GPU-efficient serverless inference platform. FaaSwap employs a holistic approach to system and algorithm design. It maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding), thereby enabling a large number of inference functions to efficiently share a node's GPUs. FaaSwap uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to achieve the optimal performance. We also develop an interference-aware request scheduling algorithm that allows FaaSwap to meet the latency SLOs for individual inference functions. We have implemented FaaSwap as a prototype on a leading commercial serverless platform. Experimental evaluations demonstrate that, with model swapping, FaaSwap can concurrently serve hundreds of functions on a single worker node with 4 V100 GPUs, while achieving inference performance comparable to native execution (where each function runs on a dedicated GPU). When deployed on a 6-node production testbed, FaaSwap meets the latency SLOs for over 1k functions, the maximum that the testbed can handle concurrently.

Problem

Research questions and friction points this paper is trying to address.

Enable efficient GPU sharing in serverless platforms for inference

Minimize latency overhead caused by dynamic model swapping

Reduce GPU provisioning costs for users and platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic GPU model swapping for efficient sharing

Asynchronous API redirection to minimize latency

Interference-aware scheduling with GPU interconnects

🔎 Similar Papers

throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving