FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Transformer inference is highly susceptible to soft errors, and existing fault-tolerant approaches—relying on decoupled protection of individual operators—impose substantial computational and memory overhead. To address this, we propose End-to-End Fault-Tolerant Attention (EFTA), which jointly implements error detection and correction within a fully fused attention kernel. Our key contributions include: (1) architecture-aware tensor-level Algorithm-Based Fault Tolerance (ABFT) checksumming; (2) selective neuron-value clamping to suppress error propagation; and (3) checksum information reuse to minimize redundant memory accesses and fault exposure. Implemented via CUDA and Tensor Core optimizations, EFTA achieves up to 7.56× speedup over conventional methods while maintaining an average fault-tolerance overhead of only 13.9%. It delivers high error coverage (>99.9%) without compromising GPU inference efficiency or reliability.

Technology Category

Application Category

📝 Abstract

Transformer models leverage self-attention mechanisms to capture complex dependencies, demonstrating exceptional performance in various applications. However, the long-duration high-load computations required for model inference impose stringent reliability demands on the computing platform, as soft errors that occur during execution can significantly degrade model performance. Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead. In this paper, we propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention (EFTA) to improve inference reliability against soft errors. Our approach enables error detection and correction within a fully fused attention kernel, reducing redundant data access and thereby mitigating memory faults. To further enhance error coverage and reduce overhead, we design a hybrid fault tolerance scheme tailored for the EFTA, introducing for the first time: 1) architecture-aware algorithm-based fault tolerance (ABFT) using tensor checksum, which minimizes inter-thread communication overhead on tensor cores during error detection; 2) selective neuron value restriction, which selectively applies adaptive fault tolerance constraints to neuron values, balancing error coverage and overhead; 3) unified verification, reusing checksums to streamline multiple computation steps into a single verification process. Experimental results show that EFTA achieves up to 7.56x speedup over traditional methods with an average fault tolerance overhead of 13.9%.

Problem

Research questions and friction points this paper is trying to address.

Enhancing Transformer reliability against soft errors during inference

Reducing computational and memory overhead in fault tolerance methods

Integrating end-to-end fault tolerant attention for error detection and correction

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end fault tolerant attention (EFTA)

Hybrid fault tolerance with tensor checksum

Selective neuron value restriction technique

🔎 Similar Papers

No similar papers found.