Towards Energy Efficient Co-Scheduling in HPC

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the inefficiencies in multi-GPU systems caused by suboptimal GPU allocation, which often leads to excessive energy consumption and poor resource utilization. The authors propose EcoSched, an online scheduler that uniquely integrates dynamic GPU count selection with application co-scheduling. EcoSched leverages lightweight runtime performance prediction, a balanced scoring strategy that jointly optimizes energy efficiency and idle resource utilization, and a NUMA-aware task placement mechanism to achieve high scheduling efficacy. Experimental evaluations on H100, A100, and V100 platforms demonstrate that EcoSched reduces energy consumption by up to 14.8%, shortens job completion time by as much as 30.1%, and decreases the energy-delay product (EDP) by up to 40.4%, thereby significantly enhancing both energy efficiency and resource utilization in heterogeneous CPU-GPU systems.

Technology Category

Application Category

📝 Abstract

Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers, with modest performance overhead. These results show that jointly selecting GPU counts and coscheduling actions is essential for efficient multi GPU workload execution.

Problem

Research questions and friction points this paper is trying to address.

energy efficiency

GPU co-scheduling

multi-GPU systems

workload efficiency

resource underutilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

co-scheduling

energy efficiency

GPU allocation