Unveiling High-Probability Generalization in Decentralized SGD

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

250K/year
🤖 AI Summary
This work addresses the lack of high-probability generalization guarantees for decentralized stochastic gradient descent (D-SGD) comparable to those of classical SGD. By introducing pointwise hypothesis stability, the paper establishes the first near-optimal high-probability generalization bounds for D-SGD—achieving a rate of 𝒪(1/√(mn)·log(1/δ))—across convex, strongly convex, and non-convex settings. The analysis extends to time-varying communication networks and local model updates. Through a refined error decomposition, characterization of local minima, and gradient-related metrics, the authors derive tight generalization and excess risk bounds under various loss functions. These results demonstrate that D-SGD maintains strong generalization performance even under communication constraints and dynamic network topologies.
📝 Abstract
Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.
Problem

Research questions and friction points this paper is trying to address.

Decentralized SGD
Generalization bound
High-probability guarantee
Distributed learning
Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized SGD
high-probability generalization
pointwise uniform stability
non-convex optimization
time-varying networks
🔎 Similar Papers
No similar papers found.