🤖 AI Summary
This study addresses the critical bottleneck of power supply in AI data centers, which increasingly constrains the advancement of general-purpose artificial intelligence. The work proposes and implements an end-to-end power management framework tailored for hyperscale AI clusters, encompassing capacity planning (6–12 months ahead), power delivery parameter tuning, and dynamic runtime power scheduling. Validated on a real-world 150-megawatt system equipped with 83K GB200 GPUs, the approach leverages empirical measurements to enable fine-grained power control, ensuring high-efficiency and stable cluster operation. By establishing a holistic, full-lifecycle power management methodology, this research fills a significant gap in current industry practices and provides a reusable infrastructure paradigm for the large-scale deployment of next-generation AI accelerators.
📝 Abstract
The electric power supply for AI data centers is now the most significant bottleneck in the race toward Artificial General Intelligence, surpassing even the constraint of AI accelerator availability. To our knowledge, this paper is the first to describe the end-to-end power management process for a hyper-scale AI datacenter; from early power planning to accommodate next-generation accelerators 6--12 months before their general availability, to tuning power settings after large scale deployment, and finally to dynamic, runtime power management for evolving workloads. We present detailed power measurements for a 150 MW datacenter hosting a cluster of 83K GB200 GPUs. We share insights from building this state-of-the-art AI cluster. We hope this work encourages practitioners across the industry to share their own experiences as well.