🤖 AI Summary
To address the challenges of analyzing TB-scale, multi-source monitoring logs, slow anomaly response, and inaccurate root-cause localization in ultra-large-scale heterogeneous HPC system operations, this paper proposes the Incremental Multi-resolution Dynamic Mode Decomposition (I-mrDMD) framework. I-mrDMD is the first method to enable online, multi-fidelity, multi-scale decomposition of streaming high-dimensional time-series data, thereby modeling spatiotemporal correlations among hardware states, job behaviors, and environmental events. It introduces a generalizable, rack-level interactive visualization scheme implemented in D3.js and—uniquely within Jupyter—achieves spatiotemporal alignment and integrated analysis of heterogeneous log sources. Evaluated on the Theta (Cray XC40) production supercomputer, I-mrDMD significantly improves both the real-time performance of anomaly detection and the accuracy of root-cause localization, effectively supporting online diagnosis for two representative operational scenarios.
📝 Abstract
With the growing complexity in architecture and the size of large-scale computing systems, monitoring and analyzing system behavior and events has become daunting. Monitoring data amounting to terabytes per day are collected by sensors housed in these massive systems at multiple fidelity levels and varying temporal resolutions. In this work, we develop an incremental version of multiresolution dynamic mode decomposition (mrDMD), which converts high-dimensional data to spatial-temporal patterns at varied frequency ranges. Our incremental implementation of the mrDMD algorithm (I-mrDMD) promptly reveals valuable information in the massive environment log dataset, which is then visually aligned with the processed hardware and job log datasets through our generalizable rack visualization using D3 visualization integrated into the Jupyter Notebook interface. We demonstrate the efficacy of our approach with two use scenarios on a real-world dataset from a Cray XC40 supercomputer, Theta.