Machine Learning Guided Cooling Optimization for Data Centers

๐Ÿ“… 2026-01-05
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the pervasive inefficiency of data center cooling systems, which leads to substantial energy waste. To tackle this challenge, the authors propose a three-stage, physics-informed machine learning framework that integrates physical constraints with data-driven modeling to enable interpretable, safe, and deployable cooling optimization. The approach combines monotonicity-constrained gradient boosting, a physically consistent baseline, and a safeguarded counterfactual adjustment mechanism, facilitating both counterfactual analysis and integration with model predictive control. Experimental results demonstrate a prediction error as low as 0.026 MW and PUE deviations under 0.01 for 98.7% of samples. The framework successfully identifies an average annual cooling energy waste of 85 MWh and recovers 96% of this excess consumption through minor, safety-guaranteed adjustments.

Technology Category

Application Category

๐Ÿ“ Abstract
Effective data center cooling is crucial for reliable operation; however, cooling systems often exhibit inefficiencies that result in excessive energy consumption. This paper presents a three-stage, physics-guided machine learning framework for identifying and reducing cooling energy waste in high-performance computing facilities. Using one year of 10-minute resolution operational data from the Frontier exascale supercomputer, we first train a monotonicity-constrained gradient boosting surrogate that predicts facility accessory power from coolant flow rates, temperatures, and server power. The surrogate achieves a mean absolute error of 0.026 MW and predicts power usage effectiveness within 0.01 of measured values for 98.7% of test samples. In the second stage, the surrogate serves as a physics-consistent baseline to quantify excess cooling energy, revealing approximately 85 MWh of annual inefficiency concentrated in specific months, hours, and operating regimes. The third stage evaluates guardrail-constrained counterfactual adjustments to supply temperature and subloop flows, demonstrating that up to 96% of identified excess can be recovered through small, safe setpoint changes while respecting thermal limits and operational constraints. The framework yields interpretable recommendations, supports counterfactual analyses such as flow reduction during low-load periods and redistribution of thermal duty across cooling loops, and provides a practical pathway toward quantifiable reductions in accessory power. The developed framework is readily compatible with model predictive control and can be extended to other liquid-cooled data centers with different configurations and cooling requirements.
Problem

Research questions and friction points this paper is trying to address.

data center cooling
energy inefficiency
cooling optimization
high-performance computing
energy waste
Innovation

Methods, ideas, or system contributions that make the work stand out.

physics-guided machine learning
cooling optimization
monotonicity-constrained surrogate
counterfactual analysis
data center energy efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.