Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing

📅 2025-08-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In manufacturing testing, numerous defective chips evade detection—exceeding industrial defect-rate targets by over an order of magnitude across diverse datacenter chips—thereby posing severe silent data corruption (SDC) risks. This paper presents the first systematic quantification of test escape rates for multiple chip types in real-world datacenters. To address this, we propose a tripartite reliability enhancement framework: (1) behavior-based diagnosis via system-level error modeling for rapid anomaly attribution; (2) lightweight runtime defect detection enabling precise on-site localization; and (3) a novel experimental methodology tailored to escaped defects. Validated through industrial case studies, our framework significantly improves defect identification rate and localization accuracy. It establishes a practical, deployable paradigm for diagnosis, detection, and assessment in high-reliability computing systems, effectively mitigating SDC probability.

Technology Category

Application Category

📝 Abstract

Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach to future directions in overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.

Problem

Research questions and friction points this paper is trying to address.

Detect silent data corruption from test escapes

Improve diagnosis of defective chips in systems

Develop new tests to identify defective chips

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quick diagnosis from system-level behaviors

In-field detection of defective chips

New test experiments for effectiveness

🔎 Similar Papers

No similar papers found.

Authors to Follow