Resource optimization with MPI process malleability for dynamic workloads in HPC clusters

📅 2025-06-01
🏛️ Future generations computer systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low resource utilization, inflexible scheduling, and lack of elasticity for MPI jobs under dynamic workloads in HPC clusters, this paper proposes a runtime optimization mechanism that deeply integrates MPI process malleability into the resource scheduling layer. We extend MPICH with a lightweight process remapping protocol, co-design a load-aware scheduler and a communication-topology-preserving migration algorithm, and natively integrate the system into Slurm. This enables fine-grained, low-overhead process scaling and cross-node migration. Our approach breaks the constraints of traditional static MPI models. Evaluation on a production HPC cluster demonstrates an average 37% improvement in resource utilization, a 29% reduction in job completion time, and communication interruption durations under 50 ms—significantly outperforming both static allocation and state-of-the-art elastic MPI solutions.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Optimizing MPI process malleability for dynamic HPC workloads
Enhancing resource utilization via modular dynamic allocation framework
Reducing memory overhead with advanced MPI reconfiguration strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular DMR framework for dynamic resource allocation
Integrates Proteo's MaM with new spawning strategies
MPI communicator merging reduces memory overhead
🔎 Similar Papers
No similar papers found.
Sergio Iserte
Sergio Iserte
Senior Researcher @ BSC
HPCResource ManagementHeterogeneous ComputingAI for Scientific Computing
I
Iker Martín-Álvarez
Universitat Jaume I (UJI), Castelló de la Plana, Spain
Krzysztof Rojek
Krzysztof Rojek
DSc, PhD, CTO @ byteLAKE, Professor @ Czestochowa University of Technology
parallel computingGPUFPGAEnergy-aware computingmachine learning
J
J. I. Aliaga
Universitat Jaume I (UJI), Castelló de la Plana, Spain
M
Maribel Castillo
Universitat Jaume I (UJI), Castelló de la Plana, Spain
W
Weronika Folwarska
Department of Computer Science, Częstochowa University of Technology (PCZ), Częstochowa, Poland
Antonio J. Peña
Antonio J. Peña
Barcelona Supercomputing Center (BSC)
HPC runtime systemsHPC communicationsheterogeneous computingparallel and distributed computing