ClusterRCA: Network Failure Diagnosis in HPC Systems Using Multimodal Data

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Network failure diagnosis in high-performance computing (HPC) systems faces challenges including difficulty in fusing multi-source heterogeneous data and low accuracy in root-cause localization. Method: This paper proposes a collaborative diagnosis framework integrating multimodal feature learning and graph-based reasoning. It first extracts NIC-pair topological relationships and multidimensional runtime state features; then employs a state classifier to generate initial fault labels, which drive a graph neural network (GNN) to model fault propagation; finally introduces a customized random-walk strategy tailored for root-cause identification to achieve precise localization and fault-type classification on the propagation graph. Contribution/Results: Evaluated on a real-world dataset from a leading HPC vendor, the method improves fault localization accuracy by +12.7% over baselines and demonstrates strong robustness under dynamic workloads and topology changes. It establishes a novel, interpretable, end-to-end automated diagnosis paradigm for large-scale HPC systems.

Technology Category

Application Category

📝 Abstract
Network failure diagnosis is challenging yet critical for high-performance computing (HPC) systems. Existing methods cannot be directly applied to HPC scenarios due to data heterogeneity and lack of accuracy. This paper proposes a novel framework, called ClusterRCA, to localize culprit nodes and determine failure types by leveraging multimodal data. ClusterRCA extracts features from topologically connected network interface controller (NIC) pairs to analyze the diverse, multimodal data in HPC systems. To accurately localize culprit nodes and determine failure types, ClusterRCA combines classifier-based and graph-based approaches. A failure graph is constructed based on the output of the state classifier, and then it performs a customized random walk on the graph to localize the root cause. Experiments on datasets collected by a top-tier global HPC device vendor show ClusterRCA achieves high accuracy in diagnosing network failure for HPC systems. ClusterRCA also maintains robust performance across different application scenarios.
Problem

Research questions and friction points this paper is trying to address.

Diagnosing network failures in HPC systems using multimodal data
Localizing culprit nodes and determining failure types accurately
Combining classifier-based and graph-based approaches for robust performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages multimodal data for failure diagnosis
Combines classifier and graph-based approaches
Uses customized random walk for root cause localization
Yongqian Sun
Yongqian Sun
Nankai University
AIOpsAnomaly DetectionFailure LocalizationMicroservices Fault DiagnosisRoot Cause Analysis
X
Xijie Pan
Nankai University
Xiao Xiong
Xiao Xiong
Nankai University
Failure Diagnosis
Lei Tao
Lei Tao
nankai university
AIOpsLLM4OpsHPC
J
Jiaju Wang
Nankai University
Shenglin Zhang
Shenglin Zhang
Nankai University
AI Operations in general
Y
Yuan Yuan
National University of Defense Technology
Y
Yuqi Li
National University of Defense Technology
K
Kunlin Jian
Huawei