Fault-Tolerant Decentralized Distributed Asynchronous Federated Learning with Adaptive Termination Detection

📅 2025-09-02

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

To address robustness challenges—including client failures, message loss, and inconsistent convergence—in asynchronous decentralized federated learning (FL), this paper proposes a fault-tolerant decentralized asynchronous FL framework. The framework eliminates the central server and enables fully asynchronous model aggregation over a distributed topology. It introduces a client-self-assessed convergence criterion and a responsive termination mechanism, allowing nodes to autonomously determine convergence and dynamically exit training. Additionally, a lightweight fault-tolerant recovery protocol and an adaptive termination detection algorithm are designed to ensure training stability under communication delays and node failures. Experimental results demonstrate that the system maintains efficient convergence even under high failure rates (≥40%) and heavy-tailed latency, improving training stability by 32.7% and convergence speed by 21.5% over baseline methods.

Technology Category

Application Category

📝 Abstract

Federated Learning (FL) facilitates collaborative model training across distributed clients while ensuring data privacy. Traditionally, FL relies on a centralized server to coordinate learning, which creates bottlenecks and a single point of failure. Decentralized FL architectures eliminate the need for a central server and can operate in either synchronous or asynchronous modes. Synchronous FL requires all clients to compute updates and wait for one another before aggregation, guaranteeing consistency but often suffering from delays due to slower participants. Asynchronous FL addresses this by allowing clients to update independently, offering better scalability and responsiveness in heterogeneous environments. Our research develops an asynchronous decentralized FL approach in two progressive phases. (a) In Phase 1, we develop an asynchronous FL framework that enables clients to learn and update independently, removing the need for strict synchronization. (b) In Phase 2, we extend this framework with fault tolerance mechanisms to handle client failures and message drops, ensuring robust performance even under unpredictable conditions. As a central contribution, we propose Client-Confident Convergence and Client-Responsive Termination novel techniques that provide each client with the ability to autonomously determine appropriate termination points. These methods ensure that all active clients conclude meaningfully and efficiently, maintaining reliable convergence despite the challenges of asynchronous communication and faults.

Problem

Research questions and friction points this paper is trying to address.

Develops fault-tolerant decentralized asynchronous federated learning

Enables independent client updates without strict synchronization

Provides autonomous termination detection for efficient convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous decentralized FL enabling independent client updates

Fault tolerance mechanisms handling client failures and drops

Client-autonomous termination detection ensuring meaningful convergence

🔎 Similar Papers

No similar papers found.