Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs

📅 2024-07-16

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

In decentralized learning on graphs using multiple concurrent random walks (RWs), node/link failures cause RW counts to decay to zero, leading to system collapse—yet existing solutions rely on global monitoring, which contradicts decentralization goals. Method: This paper proposes a fully decentralized, self-regulating mechanism without global coordination. Its core innovations include: (i) the first local estimation of RW count via return-time distribution; and (ii) DecAFork and DecAFork+, fully distributed algorithms that dynamically fork or terminate RWs to maintain RW count at a target steady state. Contribution/Results: We theoretically prove both algorithms stabilize the RW count within ±1 of the target with high probability. Simulations demonstrate >3.2× faster failure response versus baselines while preserving network load balance. To our knowledge, this is the first work achieving fault-robust, autonomous RW count regulation in a purely decentralized setting—significantly enhancing the resilience of distributed learning systems.

Technology Category

Application Category

📝 Abstract

Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience. Achieving this is challenging due to the lack of a central entity to track which RWs have failed to replace them with new ones by forking (duplicating) surviving ones. Without duplications, the number of RWs will eventually go to zero, causing a catastrophic failure of the system. We propose two decentralized algorithms called DecAFork and DecAFork+ that can maintain the number of RWs in the graph around a desired value even in the presence of arbitrary RW failures. Nodes continuously estimate the number of surviving RWs by estimating their return time distribution and fork the RWs when failures are likely to happen. DecAFork+ additionally allows terminations to avoid overloading the network by forking too many RWs. We present extensive numerical simulations that show the performance of DecAFork and DecAFork+ regarding fast detection and reaction to failures compared to a baseline, and establish theoretical guarantees on the performance of both algorithms.

Problem

Research questions and friction points this paper is trying to address.

Maintain desired random walks on graphs

Ensure resilience against node and link failures

Decentralize tracking and replacement of failed walks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized learning via RWs

Self-regulating RW duplication

Failure detection and reaction

🔎 Similar Papers

Robustness of Decentralised Learning to Nodes and Data Disruption