🤖 AI Summary
This work addresses the limitations of conventional system-on-chip (SoC) architectures that employ isolated, component-level single-event upset (SEU) mitigation techniques, which often neglect critical paths such as interconnects and voting logic, thereby creating single points of failure. To overcome this, the authors propose an overlapping cooperative fault-tolerance strategy that integrates tailored architectural-level protections for processor cores, memory, interconnects, and voting logic, achieving end-to-end, gap-free SEU resilience. Evaluated on a RISC-V microcontroller SoC through both fault-injection simulations and physical implementation, the approach demonstrates over 99.9% fault tolerance at both RTL and post-layout netlist levels. Compared to fine-grained triple modular redundancy and other global redundancy schemes, the proposed method reduces area overhead by 22%, significantly enhancing both reliability and resource efficiency.
📝 Abstract
Single-event upset (SEU) fault tolerance for systems-on-chip (SoCs) in radiation-heavy environments is often addressed by architectural fault-tolerance approaches protecting individual SoC components (e.g., cores, memories) in isolation. However, the protection of voting logic and interconnections among components is also critical, as these become single points of failure in the design. We investigate combining multiple fault-tolerance approaches targeting individual SoC components, including interconnect and voting logic to ensure end-to-end SoC-level architectural SEU fault tolerance, while minimizing implementation area overheads. Enforcing an overlap between the protection methods ensures hardening of the whole design without gaps, while curtailing overheads. We demonstrate our approach on a RISC-V microcontroller SoC. SEU fault-tolerance is assessed with simulation-based fault injection. Overheads are assessed with full physical implementation. Tolerance to over 99.9% of faults in both RTL and implemented netlist is demonstrated. Furthermore, the design exhibits 22% lower implementation overhead compared to a single global fault-tolerance method, such as fine-grained triplication.