🤖 AI Summary
In large-scale decentralized distributed learning, decentralized SGD suffers from severe convergence degradation as the number of nodes increases, primarily due to topological sparsity.
Method: This paper proposes Teleportation—a framework that dynamically activates a sparse subset of nodes, pulls parameters across topologies, and jointly executes SGD and gossip averaging over a small, time-varying active subgraph. It couples node sparsification with cross-topology parameter acquisition for the first time and introduces an adaptive hyperparameter search algorithm grounded in spectral graph theory and dynamic subgraph construction.
Contribution/Results: Theoretically, Teleportation eliminates scale-induced convergence degradation entirely. Empirically, it significantly improves training stability and final accuracy across multiple neural network tasks, while maintaining communication overhead comparable to standard decentralized SGD. Crucially, its convergence rate approaches that of centralized SGD in large-scale settings—bridging the performance gap between decentralized and centralized optimization.
📝 Abstract
Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.