🤖 AI Summary
To address model staleness and high communication overhead in Asynchronous Decentralized Federated Learning (ADFL) under heterogeneous dynamic edge environments, this paper proposes a novel joint optimization framework integrating dynamic staleness control with phase-aware topology construction. We design a worker activation strategy and an adaptive topology construction algorithm to mitigate gradient delay and reduce redundant transmissions while preserving convergence guarantees. Theoretical analysis establishes convergence under non-IID data distributions. Experiments demonstrate that, compared to state-of-the-art methods, our approach reduces training completion time by 51.8% and communication overhead by 57.1%, without sacrificing model accuracy. The core innovation lies in the co-modeling of dynamic staleness regulation and topology evolution, achieving Pareto improvements in training efficiency, communication cost, and model performance.
📝 Abstract
Federated Learning (FL) has emerged as a potential distributed learning paradigm that enables model training on edge devices (i.e., workers) while preserving data privacy. However, its reliance on a centralized server leads to limited scalability. Decentralized federated learning (DFL) eliminates the dependency on a centralized server by enabling peer-to-peer model exchange. Existing DFL mechanisms mainly employ synchronous communication, which may result in training inefficiencies under heterogeneous and dynamic edge environments. Although a few recent asynchronous DFL (ADFL) mechanisms have been proposed to address these issues, they typically yield stale model aggregation and frequent model transmission, leading to degraded training performance on non-IID data and high communication overhead. To overcome these issues, we present DySTop, an innovative mechanism that jointly optimizes dynamic staleness control and topology construction in ADFL. In each round, multiple workers are activated, and a subset of their neighbors is selected to transmit models for aggregation, followed by local training. We provide a rigorous convergence analysis for DySTop, theoretically revealing the quantitative relationships between the convergence bound and key factors such as maximum staleness, activating frequency, and data distribution among workers. From the insights of the analysis, we propose a worker activation algorithm (WAA) for staleness control and a phase-aware topology construction algorithm (PTCA) to reduce communication overhead and handle data non-IID. Extensive evaluations through both large-scale simulations and real-world testbed experiments demonstrate that our DySTop reduces completion time by 51.8% and the communication resource consumption by 57.1% compared to state-of-the-art solutions, while maintaining the same model accuracy.