Understanding the Landscape of Ampere GPU Memory Errors

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This study addresses GPU memory reliability issues on Ampere-architecture GPUs—particularly NVIDIA A100—in HPC systems, conducting a large-scale empirical analysis across multiple supercomputing platforms. Leveraging 67.77 million GPU-device-hours of operational logs, it presents the first cross-system comparative analysis of GPU memory errors. The methodology integrates statistical analysis, Mean Time Between Errors (MTBE) modeling, and multi-source log correlation to quantify the soft error rate (SER) and mean time between failures (MTBF), while identifying critical error patterns—including temperature sensitivity and spatial clustering. The work innovatively uncovers both commonalities and system-specific differences in error behavior at the system level. These findings provide the first large-scale empirical foundation for optimizing checkpointing strategies, designing fault-tolerant mechanisms, and assessing HPC system reliability—delivering actionable, engineering-oriented guidance grounded in real-world data.

Technology Category

Application Category

📝 Abstract

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC systems. In this work, we present a large-scale cross-supercomputer study to characterize GPU memory reliability, covering three supercomputers - Delta, Polaris, and Perlmutter - all equipped with NVIDIA A100 GPUs. We examine error logs spanning 67.77 million GPU device-hours across 10,693 GPUs. We compare error rates and mean-time-between-errors (MTBE) and highlight both shared and distinct error characteristics among these three systems. Based on these observations and analyses, we discuss the implications and lessons learned, focusing on the reliable operation of supercomputers, the choice of checkpointing interval, and the comparison of reliability characteristics with those of previous-generation GPUs. Our characterization study provides valuable insights into fault-tolerant HPC system design and operation, enabling more efficient execution of HPC applications.

Problem

Research questions and friction points this paper is trying to address.

Characterize GPU memory reliability in large-scale supercomputers

Analyze error rates and patterns across 10,693 A100 GPUs

Provide insights for fault-tolerant HPC system design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale cross-supercomputer GPU memory study

Analyzes 67.77 million GPU device-hours data

Compares error rates and MTBE across systems

🔎 Similar Papers

No similar papers found.