Characterizing GPU Resilience and Impact on AI/HPC Systems

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

Addressing the high-reliability requirements of AI/HPC systems, this work investigates GPU hardware resilience using two-and-a-half years of production error logs from the Delta supercomputer, covering A40, A100, and H100 GPUs. Method: We integrate hardware-level fault root-cause analysis, error propagation path modeling, and node availability simulation. Contributions/Results: (1) GPU memory reliability is found to be 30× higher than its hardware-specified MTBF; (2) the GPU System Processor (GSP) is identified as the most vulnerable single-point component, while NVLink errors—thanks to robust retry mechanisms—rarely cause job failures; (3) we quantify a 5–20% redundancy threshold, enabling a 4× resource optimization margin under a 99.9% target node availability; (4) we establish an MTBE ranking across GPU components and curate a failure-case repository linking critical hardware faults to application crashes, directly informing resilient AI accelerator architecture design.

Technology Category

Application Category

📝 Abstract

In this study, we characterize GPU failures in Delta, the current large-scale AI system with over 600 petaflops of peak compute throughput. The system comprises GPU and non-GPU nodes with modern AI accelerators, such as NVIDIA A40, A100, and H100 GPUs. The study uses two and a half years of data on GPU errors. We evaluate the resilience of GPU hardware components to determine the vulnerability of different GPU components to failure and their impact on the GPU and node availability. We measure the key propagation paths in GPU hardware, GPU interconnect (NVLink), and GPU memory. Finally, we evaluate the impact of the observed GPU errors on user jobs. Our key findings are: (i) Contrary to common beliefs, GPU memory is over 30x more reliable than GPU hardware in terms of MTBE (mean time between errors). (ii) The newly introduced GSP (GPU System Processor) is the most vulnerable GPU hardware component. (iii) NVLink errors did not always lead to user job failure, and we attribute it to the underlying error detection and retry mechanisms employed. (iv) We show multiple examples of hardware errors originating from one of the key GPU hardware components, leading to application failure. (v) We project the impact of GPU node availability on larger scales with emulation and find that significant overprovisioning between 5-20% would be necessary to handle GPU failures. If GPU availability were improved to 99.9%, the overprovisioning would be reduced by 4x.

Problem

Research questions and friction points this paper is trying to address.

Characterize GPU failures in large-scale AI systems

Evaluate resilience and impact of GPU hardware components

Assess GPU errors' effect on user jobs and system availability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed GPU failures in large-scale AI systems

Evaluated GPU hardware resilience and error impact

Projected GPU node availability with emulation techniques

🔎 Similar Papers

No similar papers found.