Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

High-performance computing (HPC) systems face energy-efficiency bottlenecks including inaccurate GPU power attribution, high idle power consumption, and poor load–power alignment. Method: Leveraging real-world telemetry and log data from the Polaris supercomputer, this work proposes a multi-source heterogeneous data co-analysis framework integrating time-series cleaning, log alignment, statistical modeling, and job-level power attribution. Contribution/Results: We introduce a lightweight analysis paradigm that preserves critical energy patterns at a 94% data compression ratio—the first of its kind. We empirically identify three actionable optimization pathways: (1) job-level dynamic power scheduling, (2) GPU idle power mitigation, and (3) load–power co-optimization. The resulting reproducible and deployable energy-efficiency diagnostic framework has enabled multiple field-validated energy-saving interventions, significantly reducing GPU idle power and improving energy efficiency per unit compute.

Technology Category

Application Category

📝 Abstract

As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.

Problem

Research questions and friction points this paper is trying to address.

Understanding GPU power consumption in HPC workloads

Analyzing large-scale heterogeneous supercomputer telemetry data

Identifying power optimization opportunities for energy-efficient computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data co-analysis of supercomputer telemetry and logs

Preprocessing and post-processing to reduce data volume

Job-level power strategies for GPU optimization

🔎 Similar Papers

No similar papers found.