Scalable Decentralized Learning with Teleportation

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

In large-scale decentralized distributed learning, decentralized SGD suffers from severe convergence degradation as the number of nodes increases, primarily due to topological sparsity. Method: This paper proposes Teleportation—a framework that dynamically activates a sparse subset of nodes, pulls parameters across topologies, and jointly executes SGD and gossip averaging over a small, time-varying active subgraph. It couples node sparsification with cross-topology parameter acquisition for the first time and introduces an adaptive hyperparameter search algorithm grounded in spectral graph theory and dynamic subgraph construction. Contribution/Results: Theoretically, Teleportation eliminates scale-induced convergence degradation entirely. Empirically, it significantly improves training stability and final accuracy across multiple neural network tasks, while maintaining communication overhead comparable to standard decentralized SGD. Crucially, its convergence rate approaches that of centralized SGD in large-scale settings—bridging the performance gap between decentralized and centralized optimization.

Technology Category

Application Category

📝 Abstract

Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.

Problem

Research questions and friction points this paper is trying to address.

Distributed Machine Learning

Decentralized SGD

Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized Learning

Quantum Teleportation-inspired Method

SGD-based Weight Sharing

🔎 Similar Papers

No similar papers found.

Nvidia

184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5

US, CA, Santa Clara / US, TX, Austin

Large Model Training Acceleration Engineer

ByteDance

圣何塞

Authors to Follow