An ECC-based Fault Tolerance Approach for DNNs

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address functional failures in safety-critical deep neural networks (DNNs)—such as those in autonomous driving—caused by memory bit-flips in weights, this paper proposes SPW, a fault-tolerant architecture that synergistically integrates error-correcting codes (ECC) with weight masking. SPW precisely corrects single-bit errors while safely masking multi-bit errors, thereby overcoming the fundamental limitation of conventional ECC schemes, which support only single-bit correction. By jointly optimizing reliability and efficiency under high bit-error rates, SPW achieves over 300% improvement in model accuracy at a bit-error rate of 10⁻¹, with only 47.5% area overhead. Notably, SPW is the first approach to combine statistical fault injection with hardware-aware fault-tolerant design, enabling robust DNN deployment in harsh memory environments.

Technology Category

Application Category

📝 Abstract
Deep Neural Network (DNN) has achieve great success in solving a wide range of machine learning problems. Recently, they have been deployed in datacenters (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. So, their correct functionality in the presence of potential bit-flip errors on DNN parameters stored in memories plays the key role in their applicability in safety-critical applications. In this paper, a fault tolerance approach based on Error Correcting Codes (ECC), called SPW, is proposed to ensure the correct functionality of DNNs in the presence of bit-flip faults. In the proposed approach, error occurrence is detected by the stored ECC and then, it is correct in case of a single-bit error or the weight is completely set to zero (i.e. masked) otherwise. A statistical fault injection campaign is proposed and utilized to investigate the efficacy of the proposed approach. The experimental results show that the accuracy of the DNN increases by more than 300% in the presence with Bit Error Rate of 10^(-1) in comparison to the case where ECC technique is applied, in expense of just 47.5% area overhead.
Problem

Research questions and friction points this paper is trying to address.

Ensuring DNN correct functionality under bit-flip errors
Detecting and correcting memory faults using ECC techniques
Improving fault tolerance with minimal area overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

ECC-based fault tolerance for DNNs
Error detection and correction with ECC
Statistical fault injection for validation
🔎 Similar Papers
No similar papers found.
Mohsen Raji
Mohsen Raji
Associate Professor, School of Electrical & Computer Engineering, Shiraz University
Efficient AIDeep LearningReliabilityIoTFPGA
M
Mohammad Zaree
School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
K
Kimia Soroush
School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran