HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Quantifying the coupled effects of scheduling policies, incentive mechanisms, and physical infrastructure (power/cooling) in HPC systems prior to deployment remains challenging. Method: This paper introduces the first digital twin framework for HPC that natively integrates scheduling functionality—embedding scheduling logic within the twin itself, interfacing with external schedulers via standardized APIs, and incorporating machine learning–based schedulers trained on multi-source, publicly available HPC operational data to enable cross-layer co-simulation. Contribution/Results: It enables the first repeatable, verifiable virtual assessment of how scheduling policies and incentive structures impact energy efficiency and cooling load. The framework has been applied to model multiple real-world HPC systems. Empirical evaluation demonstrates that the proposed ML-driven scheduler significantly outperforms baseline strategies in jointly optimizing resource utilization, energy consumption, and cooling demand—providing a scalable methodological foundation for green, sustainable supercomputing operations.

Technology Category

Application Category

📝 Abstract

Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.

Problem

Research questions and friction points this paper is trying to address.

Evaluating HPC scheduling policies and incentive structures pre-deployment

Assessing impact of scheduling decisions on power and cooling infrastructure

Enabling what-if studies for HPC system sustainability through digital twins

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital twin framework with scheduling capabilities

Integration of top-tier HPC systems datasets

Extensions for external scheduling simulators integration

🔎 Similar Papers

No similar papers found.