Towards Energy Efficient Co-Scheduling in HPC

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the inefficiencies in multi-GPU systems caused by suboptimal GPU allocation, which often leads to excessive energy consumption and poor resource utilization. The authors propose EcoSched, an online scheduler that uniquely integrates dynamic GPU count selection with application co-scheduling. EcoSched leverages lightweight runtime performance prediction, a balanced scoring strategy that jointly optimizes energy efficiency and idle resource utilization, and a NUMA-aware task placement mechanism to achieve high scheduling efficacy. Experimental evaluations on H100, A100, and V100 platforms demonstrate that EcoSched reduces energy consumption by up to 14.8%, shortens job completion time by as much as 30.1%, and decreases the energy-delay product (EDP) by up to 40.4%, thereby significantly enhancing both energy efficiency and resource utilization in heterogeneous CPU-GPU systems.

Technology Category

Application Category

📝 Abstract
Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement, and 40.4% EDP reduction over baseline schedulers, with modest performance overhead. These results show that jointly selecting GPU counts and coscheduling actions is essential for efficient multi GPU workload execution.
Problem

Research questions and friction points this paper is trying to address.

energy efficiency
GPU co-scheduling
multi-GPU systems
workload efficiency
resource underutilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

co-scheduling
energy efficiency
GPU allocation
NUMA-aware placement
runtime profiling
🔎 Similar Papers
No similar papers found.