🤖 AI Summary
The explosive growth of AI supercomputing has triggered massive, highly volatile electricity demand, posing severe challenges to grid stability and sustainable operation. To address this, we extend the open-source digital twin framework ExaDigiT to build, for the first time, a high-fidelity simulation environment for MIT’s TX-GAIA supercomputer—supporting heterogeneous, multi-tenant, cloud-scale workloads, job trace replay, and rescheduling. We introduce RAPS, a novel module that tightly integrates reinforcement learning (using the PPO algorithm) with joint modeling of supercomputing energy consumption, cooling dynamics, and power grid constraints. RAPS enables energy-aware scheduling, incentive mechanism design, and hardware-software co-optimization. Experimental evaluation demonstrates significant improvements in system throughput, energy efficiency, and controllability of carbon footprint. This work establishes a digital twin–driven paradigm for sustainable AI supercomputing scheduling.
📝 Abstract
The rapid growth of AI supercomputing is creating unprecedented power demands, with next-generation GPU datacenters requiring hundreds of megawatts and producing fast, large swings in consumption. To address the resulting challenges for utilities and system operators, we extend ExaDigiT, an open-source digital twin framework for modeling power, cooling, and scheduling of supercomputers. Originally developed for replaying traces from leadership-class HPC systems, ExaDigiT now incorporates heterogeneity, multi-tenancy, and cloud-scale workloads. In this work, we focus on trace replay and rescheduling of jobs on the MIT SuperCloud TX-GAIA system to enable reinforcement learning (RL)-based experimentation with sustainability policies. The RAPS module provides a simulation environment with detailed power and performance statistics, supporting the study of scheduling strategies, incentive structures, and hardware/software prototyping. Preliminary RL experiments using Proximal Policy Optimization demonstrate the feasibility of learning energy-aware scheduling decisions, highlighting ExaDigiT's potential as a platform for exploring optimal policies to improve throughput, efficiency, and sustainability.