DeLIAP e DeLIAJ: Interfaces de biblioteca de Dependabilidade para Python e Julia

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of lightweight, low-overhead software-level fault tolerance mechanisms in Python and Julia, this paper introduces DeLIAP (for Python) and DeLIAJ (for Julia)—the first efficient bindings of the C/C++ fault-tolerance library DeLIA to these scientific computing languages. Methodologically, the approach integrates native foreign-function interface calls (ctypes in Python, CCall in Julia), runtime error detection and recovery, and restricted local-scope checkpointing. Key contributions include: (1) filling a critical gap in system-level fault-tolerance library support for Python and Julia; (2) achieving a median performance overhead of only 1.4% in the Julia binding, as empirically measured; (3) successful integration into a real-world 4D full-waveform inversion application, demonstrating engineering viability; and (4) revealing critical constraints imposed by parallel execution models on checkpointing efficacy.

Technology Category

Application Category

📝 Abstract
The evergrowing computational complexity of High Performance Computing applications is often met with an horizontal scalling of computing systems. Colaterally, each added node risks being a single point of failure to parallel programs, increasing the demand for fault tolerant techniques to be applied, specially at software level. Under such conditions, the fault tolerance library DeLIA was developed in C/C++ with error detection and recovery features. We propose, then, to extend the library's capabilities to Python and Julia through the wrappers DeLIAP and DeLIAJ in order to lower the barrier to entry for implementing fault-tolerant solutions in these languages, which both lack alternatives to the library. To validate the efficiency of the wrappers, an application of the Julia wrapper in the 4D Full waveform inversion method was analyzed, quantitatively assessing the introduced overhead through runtime comparisons, while an implementation report is provided to address applicability. The added computational cost reflected on a median overhead of 1.4%, while limitations in the original parallel computing module used in the application rendered local-scope data checkpointing unfeasible.
Problem

Research questions and friction points this paper is trying to address.

High-performance computing
Error detection and correction
Parallel computing
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeLIAJ
Error Detection
High-Performance Computing
🔎 Similar Papers
No similar papers found.
M
Marcos Irigoyen
Dep. de Engenharia de Computação e Automação, Universidade Federal do Rio Grande do Norte, Natal, Brasil
C
Carla Santana
Lab. de Arquiteturas Paralelas para Processamento de Sinais, Universidade Federal do Rio Grande do Norte, Natal, Brasil
R
Ramon C.F Ara'ujo
Dep. de Física Teórica e Experimental, Universidade Federal do Rio Grande do Norte, Natal, Brasil
Samuel Xavier-de-Souza
Samuel Xavier-de-Souza
Computer Engineering Professor, Universidade Federal do Rio Grande do Norte
parallel computingenergy-efficient softwarescalable algorithms