๐ค AI Summary
To address low scheduling accuracy for scientific and data-intensive workflows, and insufficient dynamic resource management in high-performance computing (HPC), this paper designs and implements the first scalable job scheduling and resource management component tailored for the Structural Simulation Toolkit (SST). The component innovatively integrates a workflow management module into SSTโenabling task dependency modeling, dynamic resource allocation, and comparative simulation of multiple scheduling algorithms (e.g., Shortest Job First, Fair Share)โwhile incorporating resource reservation policies and a parallel event-processing mechanism. Experimental evaluation demonstrates sub-5% error in job waiting time and sub-3% error in node utilization across diverse workloads, alongside strong scalability to thousands of cores. This advancement significantly enhances the fidelity of cycle-accurate simulation in capturing real-world scheduler behavior.
๐ Abstract
Efficient job scheduling and resource management contributes towards system throughput and efficiency maximization in high-performance computing (HPC) systems. In this paper, we introduce a scalable job scheduling and resource management component within the structural simulation toolkit (SST), a cycle-accurate and parallel discrete-event simulator. Our proposed simulator includes state-of-the-art job scheduling algorithms and resource management techniques. Additionally, it introduces a workflow management components that supports the simulation of task dependencies and resource allocations, crucial for workflows typical in scientific computing and data-intensive applications. We present validation and scalability results of our job scheduling simulator. Simulation shows that our simulator achieves good accuracy in various metrics (e.g., job wait times, number of nodes usage) and also achieves good parallel performance.