Understanding Large-Scale HPC System Behavior Through Cluster-Based Visual Analytics

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the challenge of reliably detecting and interpreting anomalous node behaviors in large-scale high-performance computing (HPC) systems, where high-dimensional, unlabeled monitoring data complicates analysis. To tackle this, the authors propose a scalable, interactive visual analytics system that uniquely integrates contrastive learning with multi-resolution dynamic mode decomposition. Coupled with a two-stage dimensionality reduction pipeline and tailored visual encodings, the system enables users to explore, compare, and iteratively validate hypotheses about node behaviors. The approach effectively uncovers subtle intra- and inter-cluster behavioral differences, automatically identifying semantically meaningful node clusters in two real-world case studies. Expert evaluations demonstrate that the system substantially enhances both the accuracy and interpretability of anomaly detection in complex HPC environments.

Technology Category

Application Category

📝 Abstract

In high-performance computing (HPC) environments, system monitoring data is often unlabeled and high-dimensional, making it difficult to reliably detect and understand anomalous computing nodes. The growing scale and dimensionality of the collected datasets present significant challenges for analysis and visualization tasks. We present a scalable, interactive visual analytics system to support exploration, explanation, and comparison of compute node behaviors in HPC systems. Our approach integrates an analysis workflow combining two-phase dimensionality reduction with contrastive learning and multi-resolution dynamic mode decomposition to capture inter- and intra-cluster variations. These analyses are embedded in an interactive interface that enables users to explore clusters, compare temporal patterns, and iteratively refine hypotheses through customizable visual encodings and baselines. By integrating metrics such as CPU utilization and memory activity, the system offers a holistic view of large-scale system behavior. We demonstrate the utility of our tool through two case studies. In both cases, our system automatically identified meaningful node clusters and revealed subtle behavioral differences within and across node groups. Expert feedback confirmed the effectiveness of our tool in enhancing anomalous behavior detection and interpretation. Our work advances scalable visual analysis for HPC monitoring and has broader implications for cloud, edge computing, and distributed infrastructures where interpretability and behavior analysis are critical to operational efficiency.

Problem

Research questions and friction points this paper is trying to address.

high-performance computing

anomaly detection

visual analytics

high-dimensional data

system monitoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual analytics

contrastive learning

dynamic mode decomposition