Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing

📅 2025-08-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In manufacturing testing, numerous defective chips evade detection—exceeding industrial defect-rate targets by over an order of magnitude across diverse datacenter chips—thereby posing severe silent data corruption (SDC) risks. This paper presents the first systematic quantification of test escape rates for multiple chip types in real-world datacenters. To address this, we propose a tripartite reliability enhancement framework: (1) behavior-based diagnosis via system-level error modeling for rapid anomaly attribution; (2) lightweight runtime defect detection enabling precise on-site localization; and (3) a novel experimental methodology tailored to escaped defects. Validated through industrial case studies, our framework significantly improves defect identification rate and localization accuracy. It establishes a practical, deployable paradigm for diagnosis, detection, and assessment in high-reliability computing systems, effectively mitigating SDC probability.

Technology Category

Application Category

📝 Abstract
Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach to future directions in overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.
Problem

Research questions and friction points this paper is trying to address.

Detect silent data corruption from test escapes
Improve diagnosis of defective chips in systems
Develop new tests to identify defective chips
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quick diagnosis from system-level behaviors
In-field detection of defective chips
New test experiments for effectiveness
🔎 Similar Papers
No similar papers found.
Subhasish Mitra
Subhasish Mitra
William E. Ayer Endowed Chair Professor, Stanford University
Computer ScienceComputer EngineeringElectrical Engineering
S
Subho Banerjee
Google
M
Martin Dixon
Google
R
Rama Govindaraju
NVIDIA
P
Peter Hochschild
Google
E
Eric X. Liu
Google
B
Bharath Parthasarathy
Google
Parthasarathy Ranganathan
Parthasarathy Ranganathan
Google
systemscomputer architecturedatacentersenergy efficiencypower management