WRATH: Workload Resilience Across Task Hierarchies in Task-based Parallel Programming Frameworks

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The TBPP framework suffers from inefficient fault response due to heterogeneous resources and hierarchical task structures. Method: This paper proposes a task-level, fine-grained fault classification and adaptive recovery mechanism. It introduces the first fault taxonomy tailored to TBPP’s multi-layer task architecture, dynamically mapping fault root causes to differentiated recovery strategies—such as hierarchical retry—thereby overcoming the limitations of conventional uniform retry or checkpointing approaches. The method integrates distributed monitoring, layered fault diagnosis, and a resource-aware elastic retry scheduling module. Contribution/Results: Experiments demonstrate a threefold increase in task success rate, an application success rate exceeding 90% for recoverable faults, and a 20–50% improvement in fault identification speed.

Technology Category

Application Category

📝 Abstract
Failures in Task-based Parallel Programming (TBPP) can severely degrade performance and result in incomplete or incorrect outcomes. Existing failure-handling approaches, including reactive, proactive, and resilient methods such as retry and checkpointing mechanisms, often apply uniform retry mechanisms regardless of the root cause of failures, failing to account for the unique characteristics of TBPP frameworks such as heterogeneous resource availability and task-level failures. To address these limitations, we propose WRATH, a novel systematic approach that categorizes failures based on the unique layered structure of TBPP frameworks and defines specific responses to address failures at different layers. WRATH combines a distributed monitoring system and a resilient module to collaboratively address different types of failures in real time. The monitoring system captures execution and resource information, reports failures, and profiles tasks across different layers of TBPP frameworks. The resilient module then categorizes failures and responds with appropriate actions, such as hierarchically retrying failed tasks on suitable resources. Evaluations demonstrate that WRATH significantly improves TBPP robustness, tripling the task success rate and maintaining an application success rate of over 90% for resolvable failures. Additionally, WRATH can reduce the time to failure by 20%-50%, allowing tasks that are destined to fail to be identified and fail more quickly.
Problem

Research questions and friction points this paper is trying to address.

Addresses failure-handling in Task-based Parallel Programming frameworks
Proposes WRATH for categorizing and responding to layered failures
Improves robustness and reduces failure time in TBPP frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorizes failures by TBPP framework layers
Combines monitoring and resilient modules
Hierarchically retries tasks on suitable resources
🔎 Similar Papers
No similar papers found.
S
Sicheng Zhou
Department of Computer Science, Southern University of Science and Technology, Guangdong, China
Zhuozhao Li
Zhuozhao Li
Southern University of Science and Technology
Distributed SystemsHigh-performance ComputingCloud Computing
V
Val'erie Hayot-Sasson
Department of Computer Science, University of Chicago, Chicago, IL, USA
Haochen Pan
Haochen Pan
University of Chicago
Distributed SystemsCloud Computing
M
Maxime Gonthier
Department of Computer Science, University of Chicago, Chicago, IL, USA
J. Gregory Pauloski
J. Gregory Pauloski
NVIDIA (formerly at ANL and UChicago)
Computer ScienceHPCDistributed ComputingMachine LearningSystems
Ryan Chard
Ryan Chard
Argonne National Laboratory
Distributed systemscloud computing
Kyle Chard
Kyle Chard
University of Chicago and Argonne National Laboratory
computer sciencedistributed systemshigh performance computingscientific computing
I
Ian Foster
Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, USA, Department of Computer Science, University of Chicago, Chicago, IL, USA