🤖 AI Summary
Data centers face significant challenges in simultaneously optimizing energy efficiency and performance for both HPC and AI workloads under stringent power constraints. To address this, this paper proposes a hardware-software co-designed power optimization framework built upon the Blackwell architecture. Our approach introduces a workload-aware dynamic power allocation mechanism that integrates domain-knowledge-driven policy generation, GPU-level power management, Max-Q technology, phase-based power delivery control, and low-level architectural support to enable fine-grained, real-time power distribution. Experimental evaluation demonstrates that, while maintaining critical application performance at ≥97% of baseline, the system achieves up to 15% reduction in energy consumption and a 13% improvement in overall computational throughput—substantially outperforming existing static or coarse-grained power management schemes.
📝 Abstract
This paper presents datacenter power profiles, a new NVIDIA software feature released with Blackwell B200, aimed at improving energy efficiency and/or performance. The initial feature provides coarse-grain user control for HPC and AI workloads leveraging hardware and software innovations for intelligent power management and domain knowledge of HPC and AI workloads. The resulting workload-aware optimization recipes maximize computational throughput while operating within strict facility power constraints. The phase-1 Blackwell implementation achieves up to 15% energy savings while maintaining performance levels above 97% for critical applications, enabling an overall throughput increase of up to 13% in a power-constrained facility.
KEYWORDS GPU power management, energy efficiency, power profile, HPC optimization, Max-Q, Blackwell architecture