🤖 AI Summary
High electricity costs severely hinder the sustainability of multi-site high-performance computing (HPC) systems.
Method: This paper proposes TARDIS, the first scheduler integrating power-aware graph neural network (GNN)–based job power consumption prediction with a spatiotemporal cooperative scheduling framework across multiple HPC centers. TARDIS models dynamic job power profiles via GNNs and jointly optimizes task placement across time (leveraging time-of-use electricity pricing) and space (exploiting geographic price differentials) using time-varying electricity price modeling and multi-objective integer programming.
Contribution/Results: Unlike conventional single-site or single-dimensional schedulers, TARDIS achieves substantial cost reduction in trace-driven simulations: up to 18% savings in single-center temporal optimization and 10–20% in multi-center scenarios—while maintaining stable throughput and application performance. The approach enables scalable, cost-efficient, and sustainable HPC operations.
📝 Abstract
This paper introduces TARDIS (Temporal Allocation for Resource Distribution using Intelligent Scheduling), a novel power-aware job scheduler for High-Performance Computing (HPC) systems that minimizes electricity costs through both temporal and spatial optimization. Our approach addresses the growing concerns of energy consumption in HPC centers, where electricity expenses constitute a substantial portion of operational costs and have a significant financial impact. TARDIS employs a Graph Neural Network (GNN) to accurately predict individual job power consumption, then uses these predictions to strategically schedule jobs across multiple HPC facilities based on time-varying electricity prices. The system integrates both temporal scheduling, shifting power-intensive workloads to off-peak hours, and spatial scheduling, distributing jobs across geographically dispersed centers with different pricing schemes. We evaluate TARDIS using trace-based simulations from real HPC workloads, demonstrating cost reductions of up to 18% in temporal optimization scenarios and 10 to 20% in multi-site environments compared to state-of-the-art scheduling approaches, while maintaining comparable system performance and job throughput. Our comprehensive evaluation shows that TARDIS effectively addresses limitations in existing power-aware scheduling approaches by combining accurate power prediction with holistic spatial-temporal optimization, providing a scalable solution for sustainable and cost-efficient HPC operations.