🤖 AI Summary
To address the demand for high-throughput, low-latency, and highly reliable on-board intelligent real-time processing under space radiation, this work tackles task interruption caused by single-event upsets (SEUs) in commercial heterogeneous accelerators—specifically Zynq FPGAs and Myriad VPUs. We propose an end-to-end collaborative fault-tolerant architecture. Our method integrates multi-level heterogeneous redundancy: on the FPGA side, dynamic memory scrubbing, partial reconfiguration, and triple modular redundancy (TMR); on the VPU side, SHAVE-core-level redundancy, ECC-protected instruction/data memories, and a custom CRC-enhanced CIF/LCD interface. A collaborative watchdog mechanism and extended communication protocols ensure cross-chip consistency. Evaluated on real on-board platforms—including CogniSat and Q7S—the architecture significantly reduces SEU-induced task interruptions, enabling robust, efficient, and radiation-hardened on-board intelligent processing.
📝 Abstract
The ever-increasing demand for computational power and I/O throughput in space applications is transforming the landscape of on-board computing. A variety of Commercial-Off-The-Shelf (COTS) accelerators emerges as an attractive solution for payload processing to outperform the traditional radiation-hardened devices. Towards increasing the reliability of such COTS accelerators, the current paper explores and evaluates fault-tolerance techniques for the Zynq FPGA and the Myriad VPU, which are two device families being integrated in industrial space avionics architectures/boards, such as Ubotica’s CogniSat, Xiphos’ Q7S, and Cobham Gaisler’s GR-VPX-XCKU060. On the FPGA side, we combine techniques such as memory scrubbing, partial reconfiguration, triple modular redundancy, and watch-dogs. On the VPU side, we detect and correct errors in the instruction and data memories, as well as we apply redundancy at processor level (SHAVE cores). When considering FPGA with VPU co-processing, we also develop a fault-tolerant interface between the two devices based on the CIF/LCD protocols and our custom CRC error-detecting code.