Characterization-Guided GPU Fault Resilience in NVIDIA MPS

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
NVIDIA Multi-Process Service (MPS) lacks fault tolerance, as a failure in a single process triggers the collective termination of all co-located processes, severely limiting its applicability in critical scenarios such as multi-tenant GPU clusters. This work presents the first systematic characterization of the end-to-end GPU fault handling pipeline and introduces two complementary mechanisms: a software-level isolation scheme for memory-related faults and a fast recovery mechanism leveraging virtual memory–based GPU state sharing. By modifying open-source GPU driver kernel modules and integrating GPU-resident state sharing, the proposed approach achieves efficient fault tolerance across diverse GPUs and workloads, significantly enhancing MPS resilience with minimal performance overhead.
πŸ“ Abstract
NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.
Problem

Research questions and friction points this paper is trying to address.

GPU fault resilience
NVIDIA MPS
fault isolation
multi-tenant GPU clusters
process termination
Innovation

Methods, ideas, or system contributions that make the work stand out.

fault resilience
GPU sharing
NVIDIA MPS
fault isolation
fast recovery