🤖 AI Summary
This study addresses the lack of tools for co-optimizing performance, energy consumption, and total cost of ownership (TCO) in GPU-accelerated high-performance computing (HPC) clusters, a gap that complicates procurement and deployment decisions. To bridge this gap, the authors present the first interactive web-based platform integrating a benchmark-driven GPU performance scaling model, DVFS-aware piecewise power modeling, and multi-cycle TCO analysis. The platform enables users to configure heterogeneous systems, workloads, and constraints to explore a multidimensional design space. By combining Monte Carlo simulation with Sobol sensitivity analysis, it facilitates efficient, collaborative decision-making. Case studies demonstrate that, under budgetary or energy constraints, selecting energy-efficient GPUs—rather than solely prioritizing peak performance—yields significantly better overall cost-effectiveness.
📝 Abstract
The escalating computational demands and energy footprint of GPU-accelerated computing systems complicate informed design and operational decisions. We present the first release of Wattlytics (https://wattlytics.netlify.app), an interactive, browser-based decision-support system. Unlike existing procurement-oriented calculators, Wattlytics uniquely integrates benchmark-driven GPU performance scaling, dynamic voltage and frequency scaling (DVFS)-aware piecewise power modeling, and multi-year total cost of ownership (TCO) analysis within a single interactive environment. Users can configure heterogeneous systems across contemporary GPU architectures (GH200, H100, L40S, L40, A40, A100, and L4), select representative scientific workloads (e.g., GROMACS, AMBER), and explore deployment scenarios under constraints such as energy prices, system lifetime, and frequency scaling. Wattlytics computes multidimensional decision metrics (TCO breakdown, work-per-TCO, power-per-TCO, and work-per-watt-per-TCO) and supports design-space exploration, what-if scenarios, sensitivity metrics (elasticity, Sobol indices, Monte Carlo) and collaborative features to guide realistic cluster design and procurement under uncertainty. We demonstrate selected scenarios comparing deployment strategies under different operational modes: ixed budget, fixed GPU count, fixed performance, and fixed power. Our case studies show that, under budget or energy constraints, optimally deployed energy-efficient GPUs can outperform higher-performance alternatives in overall cost-effectiveness. Wattlytics helps users explore the design parameter space and distinguish between cost- and risk-driving factors, turning HPC design into a well-informed and explainable decision-making process.