🤖 AI Summary
This study addresses the lack of systematic optimization in cooling distribution units (CDUs) across multiple sub-loops in liquid-cooled supercomputing data centers. The authors propose a three-tier协同 optimization framework that jointly optimizes integer-based CDU partitioning, continuous flow allocation, and supply water temperature with total flow rate, all while satisfying thermal safety constraints. A novel low-sensitivity flow allocation strategy is introduced, enabling near-globally optimal energy efficiency through software-only adjustments on existing hardware and reducing dependence on CDU configuration by 93%. Leveraging a Modelica-based digital twin of the Frontier supercomputer and a reduced-order surrogate model, the approach evaluates 611 CDU configurations across 49,353 annual time steps. The optimal two-sub-loop configuration reduces annual cooling energy consumption by 35.48%, performing within just 0.18% of the currently deployed three-sub-loop setup, thereby demonstrating both efficacy and transferability.
📝 Abstract
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.