ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

214K/year
🤖 AI Summary
Existing functional testing struggles to effectively detect silent data corruptions (SDCs) caused by silicon manufacturing defects. This work proposes ITHICA, the first approach that leverages output inconsistencies arising from identical instructions executed in different contexts within the same thread. By automatically inserting instruction replication and intra-thread output consistency checks, ITHICA transforms arbitrary programs into highly sensitive functional tests without requiring hardware modifications. The method is applicable to both datacenter workloads and industrial test programs. Evaluation across more than 3,000 CPU servers demonstrates that ITHICA identifies 39% more defective servers than conventional mechanisms and uncovers novel defect behavior patterns, challenging established assumptions about defect manifestation in hyperscale clusters.
📝 Abstract
Hyperscaler reports of silent data corruptions (SDCs), presumed to be caused by silicon manufacturing defects, have motivated the development of functional tests for detecting defective CPUs. We present ITHICA, an approach for automatically generating functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections. We use ITHICA to transform industrial hyperscaler test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscaler fleet studies.
Problem

Research questions and friction points this paper is trying to address.

silent data corruption
silicon defect
instruction-level error
inconsistent execution
CPU reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

silent data corruption
instruction duplication
intra-thread checking
defect-induced errors
functional testing
🔎 Similar Papers
No similar papers found.