🤖 AI Summary
Quantifying the coupled effects of scheduling policies, incentive mechanisms, and physical infrastructure (power/cooling) in HPC systems prior to deployment remains challenging. Method: This paper introduces the first digital twin framework for HPC that natively integrates scheduling functionality—embedding scheduling logic within the twin itself, interfacing with external schedulers via standardized APIs, and incorporating machine learning–based schedulers trained on multi-source, publicly available HPC operational data to enable cross-layer co-simulation. Contribution/Results: It enables the first repeatable, verifiable virtual assessment of how scheduling policies and incentive structures impact energy efficiency and cooling load. The framework has been applied to model multiple real-world HPC systems. Empirical evaluation demonstrates that the proposed ML-driven scheduler significantly outperforms baseline strategies in jointly optimizing resource utilization, energy consumption, and cooling demand—providing a scalable methodological foundation for green, sustainable supercomputing operations.
📝 Abstract
Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.