Trace Replay Simulation of MIT SuperCloud for Studying Optimal Sustainability Policies

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The explosive growth of AI supercomputing has triggered massive, highly volatile electricity demand, posing severe challenges to grid stability and sustainable operation. To address this, we extend the open-source digital twin framework ExaDigiT to build, for the first time, a high-fidelity simulation environment for MIT’s TX-GAIA supercomputer—supporting heterogeneous, multi-tenant, cloud-scale workloads, job trace replay, and rescheduling. We introduce RAPS, a novel module that tightly integrates reinforcement learning (using the PPO algorithm) with joint modeling of supercomputing energy consumption, cooling dynamics, and power grid constraints. RAPS enables energy-aware scheduling, incentive mechanism design, and hardware-software co-optimization. Experimental evaluation demonstrates significant improvements in system throughput, energy efficiency, and controllability of carbon footprint. This work establishes a digital twin–driven paradigm for sustainable AI supercomputing scheduling.

Technology Category

Application Category

📝 Abstract
The rapid growth of AI supercomputing is creating unprecedented power demands, with next-generation GPU datacenters requiring hundreds of megawatts and producing fast, large swings in consumption. To address the resulting challenges for utilities and system operators, we extend ExaDigiT, an open-source digital twin framework for modeling power, cooling, and scheduling of supercomputers. Originally developed for replaying traces from leadership-class HPC systems, ExaDigiT now incorporates heterogeneity, multi-tenancy, and cloud-scale workloads. In this work, we focus on trace replay and rescheduling of jobs on the MIT SuperCloud TX-GAIA system to enable reinforcement learning (RL)-based experimentation with sustainability policies. The RAPS module provides a simulation environment with detailed power and performance statistics, supporting the study of scheduling strategies, incentive structures, and hardware/software prototyping. Preliminary RL experiments using Proximal Policy Optimization demonstrate the feasibility of learning energy-aware scheduling decisions, highlighting ExaDigiT's potential as a platform for exploring optimal policies to improve throughput, efficiency, and sustainability.
Problem

Research questions and friction points this paper is trying to address.

Modeling power and cooling for next-generation GPU datacenters with large consumption swings
Studying optimal sustainability policies through reinforcement learning on supercomputer workloads
Developing simulation environment for energy-aware scheduling decisions to improve efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends ExaDigiT digital twin framework for supercomputers
Enables reinforcement learning for sustainability policy experimentation
Provides trace replay simulation with power statistics
🔎 Similar Papers
No similar papers found.
W
Wesley Brewer
Oak Ridge National Laboratory, Oak Ridge, USA
Matthias Maiterth
Matthias Maiterth
Oak Ridge National Laboratory
High Performance ComputingParallel ComputingEnergy EfficiencyScalable Performance ToolsComputer Architecture
D
Damien Fay
Hewlett Packard Enterprise, Dublin, Ireland