Extracting Practical, Actionable Energy Insights from Supercomputer Telemetry and Logs

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-performance computing (HPC) systems face energy-efficiency bottlenecks including inaccurate GPU power attribution, high idle power consumption, and poor load–power alignment. Method: Leveraging real-world telemetry and log data from the Polaris supercomputer, this work proposes a multi-source heterogeneous data co-analysis framework integrating time-series cleaning, log alignment, statistical modeling, and job-level power attribution. Contribution/Results: We introduce a lightweight analysis paradigm that preserves critical energy patterns at a 94% data compression ratio—the first of its kind. We empirically identify three actionable optimization pathways: (1) job-level dynamic power scheduling, (2) GPU idle power mitigation, and (3) load–power co-optimization. The resulting reproducible and deployable energy-efficiency diagnostic framework has enabled multiple field-validated energy-saving interventions, significantly reducing GPU idle power and improving energy efficiency per unit compute.

Technology Category

Application Category

📝 Abstract
As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.
Problem

Research questions and friction points this paper is trying to address.

Understanding GPU power consumption in HPC workloads
Analyzing large-scale heterogeneous supercomputer telemetry data
Identifying power optimization opportunities for energy-efficient computing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data co-analysis of supercomputer telemetry and logs
Preprocessing and post-processing to reduce data volume
Job-level power strategies for GPU optimization
🔎 Similar Papers
No similar papers found.
M
Melanie Cornelius
University of Illinois at Chicago, Chicago, IL, USA
G
Greg Cross
University of Illinois at Chicago, Chicago, IL, USA
S
Shilpika Shilpika
Argonne National Laboratory, Lemont, IL, USA
M
Matthew T. Dearing
University of Illinois at Chicago, Chicago, IL, USA
Zhiling Lan
Zhiling Lan
Professor of Computer Science, University of Illinois Chicago
cluster schedulingenergy efficiencyAI4Sysmodeling and simulationresilience