π€ AI Summary
This work addresses the challenge of millisecond-scale power surges in data centers caused by GPU synchronization during AI training, which pose risks to grid stability. The authors propose EasyRider, a rack-level architecture that integrates passive circuit elements with an actively controlled auxiliary energy storage system. Co-designed with software-level optimizations to extend storage lifetime, EasyRider operates without modifications to existing training frameworks or compromises in energy efficiency. To the best of the authorsβ knowledge, this is the first solution capable of effectively mitigating rapid power transients from AI workloads. Evaluated on a 400VDC prototype, the system successfully confines rack-level power fluctuations within grid compliance limits across diverse load profiles, achieving a practical balance between grid safety and system energy efficiency.
π Abstract
Large-scale AI model training workloads use thousands of GPUs operating in tightly synchronized loops. During synchronous communication, start-up, shut-down, and checkpointing, GPU power consumption can swing from peak to idle within milliseconds. These large and rapid load swings endanger grid infrastructure as they induce steep power ramp rates, voltage and frequency shifts, and reactive power transients that can damage transformers, converters, and protection equipment. To solve this problem, we introduce EasyRider, a power architecture to mitigate power fluctuations at the rack level. EasyRider uses passive components and actively-controlled auxiliary energy storage to attenuate rack power swings. A software system continually monitors the energy storage system to maximize its lifetime in the presence of frequent charge/discharge cycles. EasyRider filters rack power variations to be within grid safety requirements without requiring software modifications to AI training frameworks or wasting energy. We evaluate EasyRider on a 400VDC-rated prototype system against published workload traces and our own GPU testbed, demonstrating its effectiveness across heterogeneous power levels and workload power profiles.