Empirically-Calibrated H100 Node Power Models for Reducing Uncertainty in AI Training Energy Estimation

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing AI training energy estimation methods rely heavily on manufacturer-specified thermal design power (TDP), leading to substantial inaccuracies (27–37% error) due to unaccounted hardware-level power dynamics and architectural heterogeneity. Method: This work develops an architecture-aware, computation-intensity-driven statistical power model grounded in empirical power measurements from an eight-GPU NVIDIA H100 node and open-source benchmarks. It introduces floating-point operations (FLOPs) as a calibration factor and incorporates explicit architecture classification (e.g., Transformer vs. CNN) to capture divergent dynamic power behaviors. Contribution/Results: We empirically demonstrate that H100 training power consumption reaches only 76% of its TDP—first such quantification—and reveal pronounced power-profile disparities between Transformer- and CNN-based workloads. Our model achieves a mean absolute percentage error of 11.4%, more than doubling the accuracy of TDP-based estimation. Furthermore, it enables quantitative assessment of grid power fluctuation risks induced by Transformer workloads, providing a robust metrological foundation for green AI infrastructure planning and environmental impact evaluation.

Technology Category

Application Category

📝 Abstract

As AI's energy demand continues to grow, it is critical to enhance the understanding of characteristics of this demand, to improve grid infrastructure planning and environmental assessment. By combining empirical measurements from Brookhaven National Laboratory during AI training on 8-GPU H100 systems with open-source benchmarking data, we develop statistical models relating computational intensity to node-level power consumption. We measure the gap between manufacturer-rated thermal design power (TDP) and actual power demand during AI training. Our analysis reveals that even computationally intensive workloads operate at only 76% of the 10.2 kW TDP rating. Our architecture-specific model, calibrated to floating-point operations, predicts energy consumption with 11.4% mean absolute percentage error, significantly outperforming TDP-based approaches (27-37% error). We identified distinct power signatures between transformer and CNN architectures, with transformers showing characteristic fluctuations that may impact grid stability.

Problem

Research questions and friction points this paper is trying to address.

Develops power models for H100 nodes to reduce AI energy estimation uncertainty

Measures gap between manufacturer-rated TDP and actual AI training power demand

Identifies distinct power signatures between transformer and CNN architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical H100 power models reduce estimation errors

Calibrated models outperform TDP-based energy predictions

Transformer and CNN architectures show distinct power signatures

🔎 Similar Papers

No similar papers found.