🤖 AI Summary
In manufacturing testing, numerous defective chips evade detection—exceeding industrial defect-rate targets by over an order of magnitude across diverse datacenter chips—thereby posing severe silent data corruption (SDC) risks. This paper presents the first systematic quantification of test escape rates for multiple chip types in real-world datacenters. To address this, we propose a tripartite reliability enhancement framework: (1) behavior-based diagnosis via system-level error modeling for rapid anomaly attribution; (2) lightweight runtime defect detection enabling precise on-site localization; and (3) a novel experimental methodology tailored to escaped defects. Validated through industrial case studies, our framework significantly improves defect identification rate and localization accuracy. It establishes a practical, deployable paradigm for diagnosis, detection, and assessment in high-reliability computing systems, effectively mitigating SDC probability.
📝 Abstract
Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach to future directions in overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.