RedMulE-FT: A Reconfigurable Fault-Tolerant Matrix Multiplication Engine

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address fault tolerance challenges in safety-critical parallel floating-point accelerators, this paper proposes a runtime-configurable fault-tolerant matrix multiplication architecture. To overcome the high overhead of duplication-based approaches and the inflexibility of error-correcting codes (ECC), we introduce a novel dynamic fault-tolerance mechanism that synergistically integrates task-level redundancy with error-detection coding—ensuring full-path protection of control signals and zero functional errors. Building upon RedMulE, our design employs shadow context registers to configure fault-tolerance modes dynamically, coordinating data-path redundancy with multi-level control-signal protection. Experimental results demonstrate an 11× reduction in uncorrected fault rate; zero functional errors across one million fault-injection trials under full protection; and only a 25.2% area overhead while sustaining a 500 MHz operating frequency in 12 nm CMOS technology.

Technology Category

Application Category

📝 Abstract

As safety-critical applications increasingly rely on data-parallel floating-point computations, there is an increasing need for flexible and configurable fault tolerance in parallel floating-point accelerators such as tensor engines. While replication-based methods ensure reliability but incur high area and power costs, error correction codes lack the flexibility to trade off robustness against performance. This work presents RedMulE-FT, a runtime-configurable fault-tolerant extension of the RedMulE matrix multiplication accelerator, balancing fault tolerance, area overhead, and performance impacts. The fault tolerance mode is configured in a shadowed context register file before task execution. By combining replication with error-detecting codes to protect the data path, RedMulE-FT achieves an 11x uncorrected fault reduction with only 2.3% area overhead. Full protection extends to control signals, resulting in no functional errors after 1M injections during our extensive fault injection simulation campaign, with a total area overhead of 25.2% while maintaining a 500 MHz frequency in a 12 nm technology.

Problem

Research questions and friction points this paper is trying to address.

Flexible fault tolerance for parallel floating-point accelerators

Balancing fault tolerance, area overhead, and performance impacts

Reducing uncorrected faults with minimal area overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconfigurable fault-tolerant matrix multiplication engine

Combines replication with error-detecting codes

Configurable fault tolerance with minimal area overhead

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

Senior AI Performance Architect

Qualcomm

$126,700.00 - $217,900.00

Raleigh, North Carolina, United States of America / Santa Clara, California, United States of America / San Diego, California, United States of America

Authors to Follow