🤖 AI Summary
Surging AI workloads exacerbate power consumption and thermal management challenges in data centers. Method: We systematically evaluate the impact of liquid cooling versus air cooling on training performance of large language models (LLMs) and vision-language models (VLMs) using an 8-GPU NVIDIA H100 cluster. Leveraging GPU Burn, IPMItool, and Weights & Biases, we concurrently measure GPU temperature, power draw, and computational throughput. Contribution/Results: Liquid cooling stabilizes GPU temperatures at 41–50°C and achieves a measured compute throughput of 54 TFLOPs—17% higher than air cooling—while significantly improving energy efficiency (lower energy per unit compute). To our knowledge, this is the first empirical, workload-driven quantification of liquid cooling’s performance and energy-efficiency benefits for H100 GPUs under realistic LLM/VLM training workloads. The findings provide a practical, sustainable thermal infrastructure solution for hyperscale AI data centers.
📝 Abstract
The unprecedented growth in artificial intelligence (AI) workloads, recently dominated by large language models (LLMs) and vision-language models (VLMs), has intensified power and cooling demands in data centers. This study benchmarks LLMs and VLMs on two HGX nodes, each with 8x NVIDIA H100 graphics processing units (GPUs), using liquid and air cooling. Leveraging GPU Burn, Weights and Biases, and IPMItool, we collect detailed thermal, power, and computation data. Results show that the liquid-cooled systems maintain GPU temperatures between 41-50 degrees Celsius, while the air-cooled counterparts fluctuate between 54-72 degrees Celsius under load. This thermal stability of liquid-cooled systems yields 17 percent higher performance (54 TFLOPs per GPU vs. 46 TFLOPs per GPU), improved performance per watt, reduced energy overhead, and greater system efficiency than the air-cooled counterparts. These findings underscore the energy and sustainability benefits of liquid cooling, offering a compelling path forward for hyperscale data centers s