More for Less: Integrating Capability-Predominant and Capacity-Predominant Computing

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the low resource utilization and high operational costs in high-performance computing (HPC) systems caused by segregated scheduling of capability workloads (large/long-running jobs) and capacity workloads (small/short-running jobs). We propose a cross-paradigm cooperative scheduling paradigm. Leveraging real-world job traces from U.S. Department of Energy (DOE) production systems, we employ workload characterization, trace-driven event-based simulation, and a dual-path analytical framework—integrating and injecting workloads—to empirically validate, for the first time, the feasibility of unified scheduling. Results demonstrate that unified scheduling improves resource utilization on capability systems by up to 37% and significantly reduces total compute infrastructure operating costs. This work challenges the conventional HPC practice of task-type–based siloed deployment and provides critical empirical evidence and a novel scheduling paradigm for next-generation elastic, cost-efficient HPC architectures.

Technology Category

Application Category

📝 Abstract
Capability jobs (e.g., large, long-running tasks) and capacity jobs (e.g., small, short-running tasks) are two common types of workloads in high-performance computing (HPC). Different HPC systems are typically deployed to handle distinct computing workloads. For example, Theta at the Argonne Leadership Computing Facility (ALCF) primarily serves capability jobs, while Cori at the National Energy Research Scientific Computing Center (NERSC) predominantly handles capacity workloads. However, this segregation often leads to inefficient resource utilization and higher costs due to the need for operating separate computing platforms. This work examines what-if scenarios for integrating siloed platforms. Specifically, we collect and characterize two real workloads from production systems at DOE laboratories, representing capabilitypredominant and capacity-predominant computing, respectively. We investigate two approaches to unification. Workload fusion explores how efficiently resources are utilized when a unified system accommodates diverse workloads, whereas workload injection identifies opportunities to enhance resource utilization on capability computing systems by leveraging capacity jobs. Finally, through extensive trace-based, event-driven simulations, we explore the potential benefits of co-scheduling both types of jobs on a unified system to enhance resource utilization and reduce costs, offering new insights for future research in unified computing.
Problem

Research questions and friction points this paper is trying to address.

High-Performance Computing
Resource Optimization
Task Allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated Computing
Efficiency Improvement
Unified Computational Architecture
🔎 Similar Papers
No similar papers found.