🤖 AI Summary
Decentralized learning is often hindered by high peer-to-peer communication overhead, and frequent synchronization is conventionally deemed necessary for competitive generalization. This work proposes a minimalist communication scheduling strategy: performing only a single global model aggregation at the very end of training, while running all intermediate iterations in a fully decentralized manner. Theoretically, we show that moderate local model divergence accelerates convergence rather than impeding optimization. Under drastically reduced total communication volume, our method achieves generalization performance on par with centralized parameter-server training and enjoys a provably faster convergence rate than standard mini-batch SGD. Extensive experiments demonstrate robustness across heterogeneous data distributions and diverse network topologies. By challenging the entrenched assumption that decentralization necessitates frequent synchronization, this work establishes a new paradigm for high-performance distributed learning with minimal communication cost.
📝 Abstract
Decentralized learning provides a scalable alternative to traditional parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Our empirical results show that concentrating communication budgets in the later stages of decentralized training markedly improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, is sufficient to match the performance of server-based training. We further show that low communication in decentralized learning preserves the extit{mergeability} of local models throughout training. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can converge faster than centralized mini-batch SGD. Technically, we novelly reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components that accelerate convergence. This work challenges the common belief that decentralized learning generalizes poorly under data heterogeneity and limited communication, while offering new insights into model merging and neural network loss landscapes.