Topology-aware Generalization of Decentralized SGD

📅 2022-06-25
🏛️ International Conference on Machine Learning
📈 Citations: 36
Influential: 3
📄 PDF

career value

237K/year
🤖 AI Summary
This work addresses the generalization capability of decentralized stochastic gradient descent (D-SGD) under non-convex and non-smooth settings. We propose the first topology-aware algorithmic stability analysis framework, integrating spectral graph theory with stochastic optimization. Our analysis explicitly quantifies the dependence of generalization error on the communication topology’s spectral gap λ, the number of nodes m, and the total sample size N—demonstrating that generalization error monotonically decreases with increasing λ. This theoretical advance overcomes limitations of prior projection-based analyses by yielding nontrivial generalization bounds and uncovering the intrinsic mechanism by which initial consensus formation enhances generalization. Empirical validation on CIFAR-10/100 and Tiny-ImageNet using VGG-11 and ResNet-18 confirms that communication topology design significantly impacts model generalization. Our results provide a principled theoretical foundation for co-designing network architecture and decentralized learning protocols.
📝 Abstract
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $mathcal{O}{(N^{-1}+m^{-1} +lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size, $m$ is the worker number, and $1+lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $mathcal{O}{(N^{-(1+alpha)/2}+ m^{-(1+alpha)/2}+lambda^{1+alpha} + phi_{mathcal{S}})}$ in-average generalization bound, which is non-vacuous even when $lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD is positively correlated with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
Problem

Research questions and friction points this paper is trying to address.

Analyzes stability of decentralized SGD in non-convex settings
Establishes generalization bounds linked to topology spectral gap
Explains consensus control impact on D-SGD generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized SGD with topology-aware generalization
Non-vacuous generalization bound for D-SGD
Consensus control improves spectral gap