FedCostAware: Enabling Cost-Aware Federated Learning on the Cloud

๐Ÿ“… 2025-05-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Federated learning (FL) deployment in sensitive domains such as healthcare faces dual challenges on cloud platforms: resource-constrained clients (e.g., hospitals lacking GPUs) and prohibitively high cloud infrastructure costs. Method: This paper proposes the first cost-aware FL framework tailored for cloud spot instances. It integrates budget-aware hierarchical client selection, elastic spot instance lifecycle scheduling, fault-tolerant synchronous aggregation, and a real-time instance state prediction model. Contribution/Results: By modeling spot instance volatility as a schedulable resource constraint, the framework jointly optimizes training timeliness and budget adherence. Experiments across multiple benchmark datasets demonstrate an average 63.2% reduction in cloud cost compared to conventional on-demand or spot-only baselines, while achieving a 98.7% training completion rate. The approach significantly enhances both feasibility and scalability of low-budget institutionsโ€™ participation in collaborative federated modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
Federated learning (FL) is a distributed machine learning (ML) approach that allows multiple clients to collaboratively train ML model without exchanging their original training data, offering a solution that is particularly valuable in sensitive domains such as biomedicine. However, training robust FL models often requires substantial computing resources from participating clients, such as GPUs, which may not be readily available at institutions such as hospitals. While cloud platforms (e.g., AWS) offer on-demand access to such resources, their usage can incur significant costs, particularly in distributed training scenarios where poor coordination strategies can lead to substantial resource wastage. To address this, we introduce FedCostAware, a cost-aware scheduling algorithm designed to optimize synchronous FL on cloud spot instances. FedCostAware addresses the challenges of training on spot instances and different client budgets by employing intelligent management of the lifecycle of spot instances. This approach minimizes resource idle time and overall expenses. Comprehensive experiments across multiple datasets demonstrate that FedCostAware significantly reduces cloud computing costs compared to conventional spot and on-demand schemes, enhancing the accessibility and affordability of FL.
Problem

Research questions and friction points this paper is trying to address.

Optimizing federated learning costs on cloud platforms
Managing spot instances and client budget constraints
Reducing resource idle time and cloud expenses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cost-aware scheduling algorithm for FL
Optimizes synchronous FL on spot instances
Minimizes resource idle time and expenses
๐Ÿ”Ž Similar Papers
No similar papers found.
A
Aditya Sinha
Data Science and Learning Division, Argonne National Laboratory; The Grainger College of Engineering, University of Illinois at Urbana-Champaign; Center for AI Innovation, National Center for Supercomputing Applications
Zilinghan Li
Zilinghan Li
Machine Learning Engineer, Argonne National Laboratory
Federated learningDistributed computingHigh Performance ComputingBiomedical Informatics
T
Tingkai Liu
The Grainger College of Engineering, University of Illinois at Urbana-Champaign; Center for AI Innovation, National Center for Supercomputing Applications
V
Volodymyr V. Kindratenko
The Grainger College of Engineering, University of Illinois at Urbana-Champaign; Center for AI Innovation, National Center for Supercomputing Applications
Kibaek Kim
Kibaek Kim
Argonne National Laboratory
OptimizationDistributed LearningPower Systems
Ravi Madduri
Ravi Madduri
Senior Scientist, University of Chicago, Argonne National Lab
Distributed computingBiomedical InformaticsData intensive scienceScientific WorkflowsBig Data