🤖 AI Summary
Conventional fault-tolerance mechanisms—such as ECC and hardware redundancy—are inadequate against permanent hardware faults in spaceborne processors induced by high-energy radiation and adversarial attacks.
Method: This paper proposes a detection–response fault-tolerant architecture integrating physical-layer sensing, instruction-level software-adaptive reconfiguration, and FPGA-based partial reconfiguration at the hardware level. It introduces an on-chip delay sensor for pre-fault radiation sensing and enables, for the first time, dynamic instruction remapping and resource-level self-healing reconfiguration under permanent faults.
Contribution/Results: Implemented on a 28 nm FPGA-based RISC-V processor, the architecture detects and recovers from diverse soft and hard errors while ensuring continuous system operation. Experimental evaluation demonstrates sub-100 ns fault detection and response latency and hardware reconfiguration recovery time under 5 ms.
📝 Abstract
Satellites are highly vulnerable to adversarial glitches or high-energy radiation in space, which could cause faults on the onboard computer. Various radiation- and fault-tolerant methods, such as error correction codes (ECC) and redundancy-based approaches, have been explored over the last decades to mitigate temporary soft errors on software and hardware. However, conventional ECC methods fail to deal with hard errors or permanent faults in the hardware components. This work introduces a detection- and response-based countermeasure to deal with partially damaged processor chips. It recovers the processor chip from permanent faults and enables continuous operation with available undamaged resources on the chip. We incorporate digitally-compatible delay-based sensors on the target processor's chip to reliably detect the incoming radiation or glitching attempts on the physical fabric of the chip, even before a fault occurs. Upon detecting a fault in one or more components of the processor's arithmetic logic unit (ALU), our countermeasure employs adaptive software recompilations to resynthesize and substitute the affected instructions with instructions of still functioning components to accomplish the task. Furthermore, if the fault is more widespread and prevents the correct operation of the entire processor, our approach deploys adaptive hardware partial reconfigurations to replace and reroute the failed components to undamaged locations of the chip. To validate our claims, we deploy a high-energy near-infrared (NIR) laser beam on a RISC-V processor implemented on a 28~nm FPGA to emulate radiation and even hard errors by partially damaging the FPGA fabric. We demonstrate that our sensor can confidently detect the radiation and trigger the processor testing and fault recovery mechanisms. Finally, we discuss the overhead imposed by our countermeasure.