Provisioning to Runtime Optimization of a +100 MW AI Cluster

📅 2026-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical bottleneck of power supply in AI data centers, which increasingly constrains the advancement of general-purpose artificial intelligence. The work proposes and implements an end-to-end power management framework tailored for hyperscale AI clusters, encompassing capacity planning (6–12 months ahead), power delivery parameter tuning, and dynamic runtime power scheduling. Validated on a real-world 150-megawatt system equipped with 83K GB200 GPUs, the approach leverages empirical measurements to enable fine-grained power control, ensuring high-efficiency and stable cluster operation. By establishing a holistic, full-lifecycle power management methodology, this research fills a significant gap in current industry practices and provides a reusable infrastructure paradigm for the large-scale deployment of next-generation AI accelerators.
📝 Abstract
The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the first to describe the end-to-end power management process for a hyper-scale AI datacenter; from early power planning to accommodate next-generation accelerators 6--12 months before their general availability, to tuning power settings after large scale deployment, and finally to dynamic, runtime power management for evolving workloads. We present detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs. We share insights from building this state-of-the-art AI cluster. We hope this work encourages practitioners across the industry to share their own experiences as well.
Problem

Research questions and friction points this paper is trying to address.

power provisioning
AI datacenter
runtime optimization
power management
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

power provisioning
runtime power optimization
AI datacenter
dynamic power management
hyper-scale infrastructure
🔎 Similar Papers
No similar papers found.
Ehsan K. Ardestani
Ehsan K. Ardestani
Facebook
Computer Architecture
Leonardo Piga
Leonardo Piga
Meta
Jovan Stojkovic
Jovan Stojkovic
University of Illinois at Urbana-Champaign
computer architecturecloud computing
Pavan Balaji
Pavan Balaji
Argonne National Laboratory
Parallel and Distributed Computing
Mustafa Ozdal
Mustafa Ozdal
Meta
high performance computingparallel and heterogeneous computingcomputer-aided design algorithms
M
Mikel Jimenez Fernandez
Meta Platforms
M
Mihaela Dimovska
Meta Platforms
L
Luka Tadic
Meta Platforms
H
Hao Shen
Meta Platforms
D
Devika Vishwanath
Meta Platforms
R
Richa Mishra
Meta Platforms
M
Melaku Mihret
Meta Platforms
V
Valentin Andrei
Meta Platforms
M
Mauricio Cespedes
Meta Platforms
J
Julien Prigent
Meta Platforms
J
James Monahan
Meta Platforms
T
Tyler Graf
Meta Platforms
Bin Li
Bin Li
Microsoft Research
video codingvideo transmissionHEVC
C
Charles Marquez
Meta Platforms
Shobhit Kanaujia
Shobhit Kanaujia
Unknown affiliation
Kaushik Veeraraghavan
Kaushik Veeraraghavan
University of Michigan, Facebook Inc.
Distributed SystemsOperating Systems
C
Chunqiang Tang
Meta Platforms