Resilience Against Soft Faults through Adaptivity in Spectral Deferred Correction

📅 2024-11-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Increasing hardware complexity in high-performance computing exacerbates the threat of soft errors—particularly transient faults—to the correctness of numerical simulations. This paper identifies and exploits an intrinsic property of the adaptive Spectral Deferred Correction (SDC) method to propose a zero-overhead, endogenous fault-tolerance mechanism. By leveraging adaptive time-step selection and iterative temporal advancement, the approach automatically detects and corrects soft errors without introducing redundant computation, checkpointing, or additional communication. Theoretical analysis and empirical evaluation demonstrate that the method preserves high-order accuracy while achieving fault recovery capability comparable to the dedicated fault-tolerant scheme Hot Rod. Consequently, it significantly enhances both robustness and computational efficiency for large-scale scientific simulations.

Technology Category

Application Category

📝 Abstract
As supercomputers grow in hardware complexity, their susceptibility to faults increases and measures need to be taken to ensure the correctness of results. Some numerical algorithms have certain characteristics that allow them to recover from some types of faults. It has been demonstrated that adaptive Runge-Kutta methods provide resilience against transient faults without adding computational cost. Using recent advances in adaptive step size selection for spectral deferred correction (SDC), an iterative numerical time stepping scheme that can produce methods of arbitrary order, we show that adaptive SDC can also detect and correct transient faults. Its performance is found to be comparable to that of the dedicated resilience strategy Hot Rod.
Problem

Research questions and friction points this paper is trying to address.

Enhancing resilience against soft faults in supercomputers
Adaptive SDC detects and corrects transient faults
Comparing adaptive SDC performance with Hot Rod strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive SDC detects and corrects transient faults
Uses adaptive step size selection for SDC
Performance comparable to Hot Rod strategy
🔎 Similar Papers
No similar papers found.
T
Thomas Baumann
Jülich Supercomputing Centre, Wilhelm-Johnen-Straße, 52428 Jülich, Germany
S
Sebastian Götschel
Hamburg University of Technology, Am Schwarzenberg-Campus 3, 21073 Hamburg, Germany
T
Thibaut Lunet
Hamburg University of Technology, Am Schwarzenberg-Campus 3, 21073 Hamburg, Germany
Daniel Ruprecht
Daniel Ruprecht
Hamburg University of Technology
computational mathematicsparallel-in-time integrationhigh-performance computingscientific
R
R. Speck
Jülich Supercomputing Centre, Wilhelm-Johnen-Straße, 52428 Jülich, Germany