FTI-TMR: A Fault Tolerance and Isolation Algorithm for Interconnected Multicore Systems

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient fault tolerance in interconnected multicore systems under concurrent permanent and transient faults, this paper proposes a hardware-free integrated fault-tolerant architecture. It dynamically identifies reliable computing units via a stability metric, and integrates periodic online diagnosis with adaptive task scheduling to achieve fault detection, isolation, and load balancing. Key contributions are: (1) a lightweight stability metric replacing conventional redundancy mechanisms; (2) precise isolation of permanent faults through dynamic diagnosis; and (3) overcoming the limitations of two-phase TMR and R-TMR in multicore concurrent failure scenarios—without hardware overhead. Experimental results show that, compared to baseline TMR, the approach reduces task load by ~30%, improves fault coverage by 22.5%, and enhances isolation accuracy by 37.8%, significantly boosting system reliability and energy efficiency.

Technology Category

Application Category

📝 Abstract
Two-Phase Triple Modular Redundancy TMR divides redundancy operations into two stages, omitting part of the computation during fault-free operation to reduce energy consumption. However, it becomes ineffective under permanent faults, limiting its reliability in critical systems. To address this, Reactive-TMR (R-TMR) introduces permanent fault isolation mechanisms for faulty cores, tolerating both transient and permanent faults. Yet, its reliance on additional hardware increases system complexity and reduces fault tolerance when multiple cores or auxiliary modules fail. This paper proposes an integrated fault-tolerant architecture for interconnected multicore systems. By constructing a stability metric to identify reliable machines and performing periodic diagnostics, the method enables permanent fault isolation and adaptive task scheduling without extra hardware. Experimental results show that it reduces task workload by approximately 30% compared to baseline TMR and achieves superior fault coverage and isolation accuracy, significantly improving both reliability and energy efficiency.
Problem

Research questions and friction points this paper is trying to address.

Addressing permanent fault vulnerability in multicore redundancy systems
Reducing hardware complexity while maintaining fault tolerance coverage
Improving energy efficiency without compromising system reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-phase redundancy reduces energy by omitting computations
Stability metrics enable fault isolation without extra hardware
Adaptive task scheduling improves reliability and energy efficiency
🔎 Similar Papers
No similar papers found.