Trace Replay Simulation of MIT SuperCloud for Studying Optimal Sustainability Policies

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

The explosive growth of AI supercomputing has triggered massive, highly volatile electricity demand, posing severe challenges to grid stability and sustainable operation. To address this, we extend the open-source digital twin framework ExaDigiT to build, for the first time, a high-fidelity simulation environment for MIT’s TX-GAIA supercomputer—supporting heterogeneous, multi-tenant, cloud-scale workloads, job trace replay, and rescheduling. We introduce RAPS, a novel module that tightly integrates reinforcement learning (using the PPO algorithm) with joint modeling of supercomputing energy consumption, cooling dynamics, and power grid constraints. RAPS enables energy-aware scheduling, incentive mechanism design, and hardware-software co-optimization. Experimental evaluation demonstrates significant improvements in system throughput, energy efficiency, and controllability of carbon footprint. This work establishes a digital twin–driven paradigm for sustainable AI supercomputing scheduling.

Technology Category

Application Category

📝 Abstract

The rapid growth of AI supercomputing is creating unprecedented power demands, with next-generation GPU datacenters requiring hundreds of megawatts and producing fast, large swings in consumption. To address the resulting challenges for utilities and system operators, we extend ExaDigiT, an open-source digital twin framework for modeling power, cooling, and scheduling of supercomputers. Originally developed for replaying traces from leadership-class HPC systems, ExaDigiT now incorporates heterogeneity, multi-tenancy, and cloud-scale workloads. In this work, we focus on trace replay and rescheduling of jobs on the MIT SuperCloud TX-GAIA system to enable reinforcement learning (RL)-based experimentation with sustainability policies. The RAPS module provides a simulation environment with detailed power and performance statistics, supporting the study of scheduling strategies, incentive structures, and hardware/software prototyping. Preliminary RL experiments using Proximal Policy Optimization demonstrate the feasibility of learning energy-aware scheduling decisions, highlighting ExaDigiT's potential as a platform for exploring optimal policies to improve throughput, efficiency, and sustainability.

Problem

Research questions and friction points this paper is trying to address.

Modeling power and cooling for next-generation GPU datacenters with large consumption swings

Studying optimal sustainability policies through reinforcement learning on supercomputer workloads

Developing simulation environment for energy-aware scheduling decisions to improve efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends ExaDigiT digital twin framework for supercomputers

Enables reinforcement learning for sustainability policy experimentation

Provides trace replay simulation with power statistics

🔎 Similar Papers

Carbon Footprint Reduction for Sustainable Data Centers in Real-Time