An Incremental Multi-Level, Multi-Scale Approach to Assessment of Multifidelity HPC Systems

📅 2024-11-17
🏛️ SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of analyzing TB-scale, multi-source monitoring logs, slow anomaly response, and inaccurate root-cause localization in ultra-large-scale heterogeneous HPC system operations, this paper proposes the Incremental Multi-resolution Dynamic Mode Decomposition (I-mrDMD) framework. I-mrDMD is the first method to enable online, multi-fidelity, multi-scale decomposition of streaming high-dimensional time-series data, thereby modeling spatiotemporal correlations among hardware states, job behaviors, and environmental events. It introduces a generalizable, rack-level interactive visualization scheme implemented in D3.js and—uniquely within Jupyter—achieves spatiotemporal alignment and integrated analysis of heterogeneous log sources. Evaluated on the Theta (Cray XC40) production supercomputer, I-mrDMD significantly improves both the real-time performance of anomaly detection and the accuracy of root-cause localization, effectively supporting online diagnosis for two representative operational scenarios.

Technology Category

Application Category

📝 Abstract
With the growing complexity in architecture and the size of large-scale computing systems, monitoring and analyzing system behavior and events has become daunting. Monitoring data amounting to terabytes per day are collected by sensors housed in these massive systems at multiple fidelity levels and varying temporal resolutions. In this work, we develop an incremental version of multiresolution dynamic mode decomposition (mrDMD), which converts high-dimensional data to spatial-temporal patterns at varied frequency ranges. Our incremental implementation of the mrDMD algorithm (I-mrDMD) promptly reveals valuable information in the massive environment log dataset, which is then visually aligned with the processed hardware and job log datasets through our generalizable rack visualization using D3 visualization integrated into the Jupyter Notebook interface. We demonstrate the efficacy of our approach with two use scenarios on a real-world dataset from a Cray XC40 supercomputer, Theta.
Problem

Research questions and friction points this paper is trying to address.

Complex Data Analysis
Large-scale Sensor Data
Log File Interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

I-mrDMD
Big Data Analysis
Visualization
🔎 Similar Papers
No similar papers found.
S
Shilpika Shilpika
Argonne Leadership Computing Facility, Argonne National Laboratory
Bethany Lusch
Bethany Lusch
Argonne National Lab
machine learningoptimizationscientific computingdata science
V
V. Vishwanath
Argonne Leadership Computing Facility, Argonne National Laboratory
M
M. Papka
Argonne Leadership Computing Facility, Argonne National Laboratory & Department of Computer Science, University of Illinois Chicago