🤖 AI Summary
HPC systems generate massive volumes of heterogeneous telemetry data, yet existing operational data analytics (ODA) rely on schema-less storage, impeding semantic integration and cross-platform interoperability. To address this, we propose UHPC-Onto—the first unified ontology model for HPC operations analytics—enabling semantic standardization and knowledge graph (KG) integration across diverse platforms (e.g., M100, F-DATA). Our approach combines ontology modeling refinement with lightweight KG construction, significantly reducing storage overhead: up to 38.84% reduction versus baselines under various deployment configurations, and up to 26.82% with additional optimizations. UHPC-Onto supports validation via 36 competency questions and provides a scalable, semantically rich infrastructure for efficient, interpretable, cross-datacenter telemetry analysis—particularly under complex workloads such as generative AI.
📝 Abstract
Modern high-performance computing (HPC) systems generate massive volumes of heterogeneous telemetry data from millions of sensors monitoring compute, memory, power, cooling, and storage subsystems. As HPC infrastructures scale to support increasingly complex workloads-including generative AI-the need for efficient, reliable, and interoperable telemetry analysis becomes critical. Operational Data Analytics (ODA) has emerged to address these demands; however, the reliance on schema-less storage solutions limits data accessibility and semantic integration. Ontologies and knowledge graphs (KG) provide an effective way to enable efficient and expressive data querying by capturing domain semantics, but they face challenges such as significant storage overhead and the limited applicability of existing ontologies, which are often tailored to specific HPC systems only. In this paper, we present the first unified ontology for ODA in HPC systems, designed to enable semantic interoperability across heterogeneous data centers. Our ontology models telemetry data from the two largest publicly available ODA datasets-M100 (Cineca, Italy) and F-DATA (Fugaku, Japan)-within a single data model. The ontology is validated through 36 competency questions reflecting real-world stakeholder requirements, and we introduce modeling optimizations that reduce knowledge graph (KG) storage overhead by up to 38.84% compared to a previous approach, with an additional 26.82% reduction depending on the desired deployment configuration. This work paves the way for scalable ODA KGs and supports not only analysis within individual systems, but also cross-system analysis across heterogeneous HPC systems.