🤖 AI Summary
This work addresses the severe thermal challenges in high-power heterogeneous multi-chip packages, such as NVIDIA’s GB200, by proposing a parameterizable interdigitated microchannel cooling architecture. A physically accurate thermofluidic coupling framework is established using a porous medium model combined with row-level coolant energy balance. The design innovatively incorporates a cooling coverage constraint tailored to high heat-flux GPU regions and leverages a surrogate model integrated with a mixed-integer quadratic programming (MIQP) algorithm to efficiently optimize channel geometric parameters. Evaluated on a GB200 multi-chip configuration, the proposed approach significantly reduces peak temperature by 140.45 °C and average temperature by 35.87 °C compared to the baseline design, delivering an advanced thermal management solution that balances accuracy and computational efficiency.
📝 Abstract
Thermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45°C and the average chip temperature by 35.87°C compared to the baseline configuration.