Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large-scale distributed training, matrix multiplication is highly susceptible to silent data corruption (SDC), leading to error propagation and degraded model performance. To address this, we propose a novel grid-structured error-correcting coding framework specifically designed for fault-tolerant matrix multiplication. Integrating algebraic coding theory with matrix blocking techniques, our approach constructs a structured real-field encoding scheme capable of precisely locating and deterministically correcting up to two distributed erroneous symbols among the three participating matrices. Our key contribution is the first application of a grid-based coding structure to multi-matrix joint fault-tolerant computation, significantly enhancing error recovery capability under collaborative operations. GPU-based experiments demonstrate that the framework achieves 100% SDC correction reliability across multiple nodes while incurring only 24% computational overhead—striking an effective balance between robustness and practical deployability.

Technology Category

Application Category

📝 Abstract
Matrix multiplication over the real field constitutes a foundational operation in the training of deep learning models, serving as a computational cornerstone for both forward and backward propagation processes. However, the presence of silent data corruption (SDC) in large-scale distributed training environments poses a significant threat to model convergence and predictive accuracy, particularly when such errors manifest during matrix multiplication. Due to their transient and non-intrusive nature, these errors often evade detection, allowing them to propagate and accumulate over time, ultimately leading to substantial degradation in model performance. In this paper, we introduce a novel error-correcting coding framework specifically tailored for matrix multiplication operations. Our proposed framework is designed to detect and correct multiple computational errors that may arise during the execution of matrix products. By leveraging a grid-based structural encoding scheme, our approach enhances error localization and correction capabilities across all participating matrices, thereby significantly improving the fault tolerance of the computation. Experimental results demonstrate that our method achieves deterministic correction of up to two erroneous symbols distributed across three matrices with 100% reliability, while incurring only a 24% overhead in computational time on GPU architectures. Furthermore, we provide a rigorous theoretical analysis of the error-correction properties inherent to our coding scheme, establishing its correctness and robustness under well-defined fault models.
Problem

Research questions and friction points this paper is trying to address.

Detect and correct errors in matrix multiplication operations
Enhance fault tolerance in deep learning training
Improve error localization in distributed computing environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grid-based structural encoding for matrices
Detects and corrects multiple computational errors
Deterministic correction with low overhead
🔎 Similar Papers
No similar papers found.
H
Hao Shi
Department of Mathematics Sciences, Tsinghua University, Beijing, China
Z
Zhengyi Jiang
Department of Mathematics Sciences, Tsinghua University, Beijing, China and Theory Lab, Central Research Institute, 2012 Labs, Huawei Tech. Co. Ltd., Hong Kong SAR
Zhongyi Huang
Zhongyi Huang
Professor of mathematics, Tsinghua University
Scientific Computingmultiscale methodssingular perturbation problemshigh frequency waves
B
Bo Bai
Theory Lab, Central Research Institute, 2012 Labs, Huawei Tech. Co. Ltd., Hong Kong SAR
G
Gong Zhang
Theory Lab, Central Research Institute, 2012 Labs, Huawei Tech. Co. Ltd., Hong Kong SAR
Hanxu Hou
Hanxu Hou
Institute of Network Coding, The Chinese University of Hong Kong
Network codingcoding for distributed storage systemschannel coding