🤖 AI Summary
High-performance computing (HPC) systems face energy-efficiency bottlenecks including inaccurate GPU power attribution, high idle power consumption, and poor load–power alignment. Method: Leveraging real-world telemetry and log data from the Polaris supercomputer, this work proposes a multi-source heterogeneous data co-analysis framework integrating time-series cleaning, log alignment, statistical modeling, and job-level power attribution. Contribution/Results: We introduce a lightweight analysis paradigm that preserves critical energy patterns at a 94% data compression ratio—the first of its kind. We empirically identify three actionable optimization pathways: (1) job-level dynamic power scheduling, (2) GPU idle power mitigation, and (3) load–power co-optimization. The resulting reproducible and deployable energy-efficiency diagnostic framework has enabled multiple field-validated energy-saving interventions, significantly reducing GPU idle power and improving energy efficiency per unit compute.
📝 Abstract
As supercomputers grow in size and complexity, power efficiency has become a critical challenge, particularly in understanding GPU power consumption within modern HPC workloads. This work addresses this challenge by presenting a data co-analysis approach using system data collected from the Polaris supercomputer at Argonne National Laboratory. We focus on GPU utilization and power demands, navigating the complexities of large-scale, heterogeneous datasets. Our approach, which incorporates data preprocessing, post-processing, and statistical methods, condenses the data volume by 94% while preserving essential insights. Through this analysis, we uncover key opportunities for power optimization, such as reducing high idle power costs, applying power strategies at the job-level, and aligning GPU power allocation with workload demands. Our findings provide actionable insights for energy-efficient computing and offer a practical, reproducible approach for applying existing research to optimize system performance.