RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address fault tolerance challenges in safety-critical parallel floating-point accelerators, this paper proposes a runtime-configurable fault-tolerant matrix multiplication architecture. To overcome the high overhead of duplication-based approaches and the inflexibility of error-correcting codes (ECC), we introduce a novel dynamic fault-tolerance mechanism that synergistically integrates task-level redundancy with error-detection coding—ensuring full-path protection of control signals and zero functional errors. Building upon RedMulE, our design employs shadow context registers to configure fault-tolerance modes dynamically, coordinating data-path redundancy with multi-level control-signal protection. Experimental results demonstrate an 11× reduction in uncorrected fault rate; zero functional errors across one million fault-injection trials under full protection; and only a 25.2% area overhead while sustaining a 500 MHz operating frequency in 12 nm CMOS technology.

Technology Category

Application Category

📝 Abstract
As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.
Problem

Research questions and friction points this paper is trying to address.

Flexible fault tolerance for parallel floating-point accelerators
Balancing fault tolerance, area overhead, and performance impacts
Reducing uncorrected faults with minimal area overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconfigurable fault-tolerant matrix multiplication engine
Combines replication with error-detecting codes
Configurable fault tolerance with minimal area overhead
🔎 Similar Papers
No similar papers found.