π€ AI Summary
NVIDIA Multi-Process Service (MPS) lacks fault tolerance, as a failure in a single process triggers the collective termination of all co-located processes, severely limiting its applicability in critical scenarios such as multi-tenant GPU clusters. This work presents the first systematic characterization of the end-to-end GPU fault handling pipeline and introduces two complementary mechanisms: a software-level isolation scheme for memory-related faults and a fast recovery mechanism leveraging virtual memoryβbased GPU state sharing. By modifying open-source GPU driver kernel modules and integrating GPU-resident state sharing, the proposed approach achieves efficient fault tolerance across diverse GPUs and workloads, significantly enhancing MPS resilience with minimal performance overhead.
π Abstract
NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.