A Test Taxonomy and Continuous Integration Ecosystem for Dynamic Resource Management in HPC

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the lack of systematic, reproducible, and maintainable testing methodologies in existing dynamic resource management libraries. We propose an automated validation framework tailored for high-performance computing (HPC) environments, which introduces a novel multi-level testing taxonomy encompassing both functional and non-functional requirements. Built upon an MPI-based scalable library testing methodology, the framework supports core primitives of dynamic resource management systems—such as initialization, readiness checks, and reconfiguration—and integrates containerized virtual clusters with continuous integration (CI) ecosystems. Experimental evaluation demonstrates that our approach significantly improves early fault detection rates, reduces maintenance overhead caused by evolving dependencies, and is readily generalizable to other systems exhibiting similar variability mechanisms.
📝 Abstract
High-performance computing (HPC) systems are increasingly exploring dynamic resource management and malleable MPI applications to better adapt to heterogeneous architectures, fluctuating workloads, and energy constraints. However, the correctness of the libraries that support these techniques is often evaluated through ad hoc experiments that can be difficult to reproduce and maintain. This article introduces methodology for testing dynamic resource management frameworks that combines a taxonomy of tests for MPI malleable libraries with an HPC-oriented continuous integration (CI) ecosystem. The taxonomy structures functional and non-functional tests at both component-integration and system levels. The CI ecosystem instantiates this taxonomy in a containerized virtual cluster enabling automated validation. The approach is instantiated and evaluated using the Dynamic Management of Resources (DMR) framework as a representative case study. Results show that the proposed methodology improves early fault detection, simplifies maintenance under evolving dependencies, and transfers to other malleability solutions that expose analogous primitives for initialization, readiness checking, and reconfiguration.
Problem

Research questions and friction points this paper is trying to address.

dynamic resource management
malleable MPI
test reproducibility
HPC
library correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

test taxonomy
continuous integration
dynamic resource management
malleable MPI
HPC
🔎 Similar Papers