Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the limitations of conventional federated learning and Probabilistic Synchronous Parallel (PSP) approaches under correlated device failures, which often overlook certain nodes and consequently induce model bias and reduced fairness. To mitigate this, the authors propose Availability-Weighted PSP (AW-PSP), the first framework that jointly models device failure correlations and non-IID data distributions. AW-PSP introduces a Markov-based availability predictor to distinguish between transient and persistent failures and dynamically adjusts node sampling probabilities by integrating historical participation behavior with correlation-aware metrics. Decentralized metadata management is achieved via a distributed hash table, enhancing both sampling fairness and system robustness. Experimental results demonstrate that AW-PSP consistently outperforms standard PSP, improving label coverage and reducing fairness variance under both independent and correlated failure conditions, making it well-suited for large-scale, heterogeneous, and failure-prone federated learning environments.

Technology Category

Application Category

📝 Abstract
Probabilistic Synchronous Parallel (PSP) is a technique in distributed learning systems to reduce synchronization bottlenecks by sampling a subset of participating nodes per round. In Federated Learning (FL), where edge devices are often unreliable due to factors including mobility, power constraints, and user activity, PSP helps improve system throughput. However, PSP has a key limitation: it assumes device behavior is static and different devices are independent. This can lead to unfair distributed synchronization, due to highly available nodes dominating training while those that are often unavailable rarely participate and so their data may be missed. If both data distribution and node availability are simultaneously correlated with the device, then both PSP and standard FL algorithms will suffer from persistent under-representation of certain classes or groups resulting in inefficient or ineffective learning of certain features. We introduce Availability-Weighted PSP (AW-PSP), an extension to PSP that addresses the issue of co-correlation of unfair sampling and data availability by dynamically adjusting node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. A Markov-based availability predictor distinguishes transient \emph{vs} chronic failures, while a Distributed Hash Table (DHT) layer decentralizes metadata, including latency, freshness, and utility scores. We implement AW-PSP and trace-driven evaluation shows that it improves robustness to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP. AW-PSP thus provides an availability-aware, and fairness-conscious node sampling protocol for FL deployments that will scale to large numbers of nodes even in heterogeneous and failure-prone environments.
Problem

Research questions and friction points this paper is trying to address.

Federated Learning
device failure correlation
unfair sampling
data heterogeneity
synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Availability-Weighted PSP
correlated device failure
fairness-aware sampling
Markov-based availability prediction
distributed hash table
🔎 Similar Papers