Unveiling High-Probability Generalization in Decentralized SGD

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This work addresses the lack of high-probability generalization guarantees for decentralized stochastic gradient descent (D-SGD) comparable to those of classical SGD. By introducing pointwise hypothesis stability, the paper establishes the first near-optimal high-probability generalization bounds for D-SGD—achieving a rate of 𝒪(1/√(mn)·log(1/δ))—across convex, strongly convex, and non-convex settings. The analysis extends to time-varying communication networks and local model updates. Through a refined error decomposition, characterization of local minima, and gradient-related metrics, the authors derive tight generalization and excess risk bounds under various loss functions. These results demonstrate that D-SGD maintains strong generalization performance even under communication constraints and dynamic network topologies.

📝 Abstract

Decentralized stochastic gradient descent (D-SGD) is an efficient method for large-scale distributed learning. Existing generalization studies mainly address expected results, achieving rates limited to $\mathcal{O}\left(\frac{1}{δ\sqrt{mn}}\right)$, where $δ$ is the confidence parameter, $m$ the number of workers, and $n$ the sample size. When $m=1$, D-SGD reduces to traditional SGD, whose optimal high-probability generalization bound is $\mathcal{O}\left(\frac{1}{\sqrt{n}}\log (1/δ)\right)$. This discrepancy reveals a gap between high-probability guarantees for SGD and those for D-SGD. To close this, we develop a high-probability learning theory for D-SGD, aiming for the optimal $\mathcal{O}\left(\frac{1}{\sqrt{mn}}\log (1/δ)\right)$ rate. We refine bounds for D-SGD using pointwise uniform stability in distributed learning-a weaker notion than uniform stability-and analyze them across convex, strongly convex, and non-convex settings. We also provide high-probability results for gradient-based measures in non-convex cases where only local minima exist, and derive optimization error and excess risk bounds. Finally, accounting for communication overhead, we analyze generalization bounds for local models within time-varying frameworks.

Problem

Research questions and friction points this paper is trying to address.

Decentralized SGD

Generalization bound

High-probability guarantee

Distributed learning

Stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

decentralized SGD

high-probability generalization

pointwise uniform stability