🤖 AI Summary
This work addresses the lack of high-probability generalization guarantees for decentralized stochastic gradient descent (D-SGD) comparable to those of classical SGD. By introducing pointwise hypothesis stability, the paper establishes the first near-optimal high-probability generalization bounds for D-SGD—achieving a rate of 𝒪(1/√(mn)·log(1/δ))—across convex, strongly convex, and non-convex settings. The analysis extends to time-varying communication networks and local model updates. Through a refined error decomposition, characterization of local minima, and gradient-related metrics, the authors derive tight generalization and excess risk bounds under various loss functions. These results demonstrate that D-SGD maintains strong generalization performance even under communication constraints and dynamic network topologies.
📝 Abstract
Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.