🤖 AI Summary
This work addresses carbon-aware resource provisioning and scheduling for batch jobs in cloud computing clusters. Method: We propose the first cluster-level joint optimization framework that integrates carbon intensity time-series forecasting with reinforcement learning to jointly determine online decisions on resource elasticity (scale-in/scale-out) and job start/stop timing, while respecting job delay tolerance and system elasticity. We introduce a continual learning mechanism that dynamically adapts to real-time carbon intensity variations using historical cluster data, and design an extended architecture based on AWS ParallelCluster with job suspension/resumption capabilities. Contribution/Results: Evaluated on real industrial workloads, our approach reduces carbon emissions by 57% compared to carbon-agnostic baselines and achieves 97.9% of the performance of an ideal oracle, significantly improving the efficiency of green compute scheduling.
📝 Abstract
Accelerating computing demand, largely from AI applications, has led to concerns about its carbon footprint. Fortunately, a significant fraction of computing demand comes from batch jobs that are often delay-tolerant and elastic, which enables schedulers to reduce carbon by suspending/resuming jobs and scaling their resources down/up when carbon is high/low. However, prior work on carbon-aware scheduling generally focuses on optimizing carbon for individual jobs in the cloud, and not provisioning and scheduling resources for many parallel jobs in cloud clusters. To address the problem, we present CarbonFlex, a carbon-aware resource provisioning and scheduling approach for cloud clusters. CarbonFlex leverages continuous learning over historical cluster-level data to drive near-optimal runtime resource provisioning and job scheduling. We implement CarbonFlex by extending AWS ParallelCluster to include our carbon-aware provisioning and scheduling algorithms. Our evaluation on publicly available industry workloads shows that CarbonFlex decreases carbon emissions by $sim$57% compared to a carbon-agnostic baseline and performs within 2.1% of an oracle scheduler with perfect knowledge of future carbon intensity and job length.