🤖 AI Summary
This work addresses the challenge of efficient row-level fault diagnosis in systolic arrays, where existing methods rely on hardware redundancy and struggle to achieve processing element (PE)-level precision. The authors propose a lightweight algebraic testing scheme that constructs test vectors using coprime integers, enabling unique identification of faulty rows through divisibility properties observed in fault-induced outputs. This approach achieves, for the first time, single-test PE-level fault localization without hardware redundancy, supports a broader class of bounded error models, and allows exact fault identification via a second test or odd-coprime inputs. Under INT16 arithmetic, the method attains over 98% fault localization accuracy for a 256×256 array in a single test, with overhead less than 1% of a single GEMM tile inference.
📝 Abstract
Systolic arrays are the dominant compute fabric for neural network inference. Prior work has addressed column-level fault detection efficiently with uniform test patterns, but row-level (PE-level) fault localization within a faulty column remains open without resorting to hardware redundancy. The fundamental obstacle is that uniform test inputs destroy per-row signatures: any test that activates every row equally cannot distinguish which row is the source of an observed deviation.
In this paper, we propose a lightweight, purely algorithmic remedy based on coprime test vectors. By assigning pairwise coprime integers as test-input entries, a permanent weight-register fault produces a deviation whose divisibility signature uniquely identifies the faulty row. Under a general bounded error model, a single test pass localizes the faulty row with high probability. This error model covers a broader class of faults than what prior dataflow-aware testing work has primarily emphasized. When one round is insufficient, a second pass using a ratio computation achieves exact localization; for the special case of single-bit errors, odd coprime entries guarantee exact localization in one round.
For INT16 arithmetic, a single test pass covers array sizes up to $256{\times}256$ with localization probability above $0.98$, at a test cost under $1\%$ of one inference GEMM tile.