The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

251K/year
📝 Abstract
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of production-class GPU architectures.
Problem

Research questions and friction points this paper is trying to address.

Silent Data Corruption
GPU reliability
fault modeling
large-scale GPU clusters
SDC characterization
Innovation

Methods, ideas, or system contributions that make the work stand out.

silent data corruption
GPU fault modeling
stuck-at fault injection
bit-flip characterization
warp-aligned correlation
🔎 Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5