🤖 AI Summary
This work addresses the inefficiencies of static resource allocation in molecular dynamics simulations, which often leads to idle resources, queuing delays, and increased node-hour costs due to its inability to adapt to time-varying workloads. For the first time, MPI process elasticity is implemented in GROMACS by integrating a dynamic resource management (DRM) middleware that leverages the Slurm workload manager, GROMACS’ native checkpoint/restart mechanism, and a communication-efficiency-aware reconfiguration strategy to enable runtime scaling. Experiments on the MareNostrum 5 supercomputer demonstrate that the proposed approach significantly reduces node-hour consumption and reconfiguration overhead, effectively shortens simulation turnaround time, and substantially improves overall resource utilization efficiency.
📝 Abstract
Static resource allocations in high-performance computing (HPC) lead to inefficiencies for time-varying workloads, causing idle resources, queue delays, and higher node-hour costs. The Dynamic Management of Resources (DMR) middleware enables MPI process malleability in Slurm via a simple API decoupled from scheduler internals. In this work, we integrate DMR into the GROMACS molecular dynamics engine to obtain a malleable variant that can dynamically adapt its MPI process count by combining communication-efficiency-aware reconfiguration with GROMACS' native checkpoint/restart mechanism. We evaluate this design on the MareNostrum~5 supercomputer, comparing dynamic runs against static executions and quantifying reconfiguration overheads, time-to-solution, and node-hour savings for bursty GROMACS workloads.