An Accelerated Distributed Stochastic Gradient Method with Momentum

📅 2024-02-15
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
This paper addresses distributed optimization over connected networks, where agents collaboratively minimize the average of smooth local objective functions. To this end, we propose the Distributed Stochastic Momentum Tracking (DSMT) algorithm—a single-loop method that integrates Loopless Chebyshev Acceleration (LCA) into the momentum tracking framework for the first time, eliminating the need for gradient caching or multi-round communication. Theoretically, under standard variance assumptions, DSMT achieves a centralized SGD-level asymptotic convergence rate. Its transient time complexity is state-of-the-art: (Oig(n^{5/3}/(1-lambda)ig)) for smooth objectives and improved to (Oig(sqrt{n/(1-lambda)}ig)) under the Polyak–Łojasiewicz (PL) condition, where (lambda) denotes the spectral gap of the network mixing matrix. DSMT is computationally efficient, structurally simple, and inherently topology-adaptive.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce an accelerated distributed stochastic gradient method with momentum for solving the distributed optimization problem, where a group of $n$ agents collaboratively minimize the average of the local objective functions over a connected network. The method, termed ``Distributed Stochastic Momentum Tracking (DSMT)'', is a single-loop algorithm that utilizes the momentum tracking technique as well as the Loopless Chebyshev Acceleration (LCA) method. We show that DSMT can asymptotically achieve comparable convergence rates as centralized stochastic gradient descent (SGD) method under a general variance condition regarding the stochastic gradients. Moreover, the number of iterations (transient times) required for DSMT to achieve such rates behaves as $mathcal{O}(n^{5/3}/(1-lambda))$ for minimizing general smooth objective functions, and $mathcal{O}(sqrt{n/(1-lambda)})$ under the Polyak-{L}ojasiewicz (PL) condition. Here, the term $1-lambda$ denotes the spectral gap of the mixing matrix related to the underlying network topology. Notably, the obtained results do not rely on multiple inter-node communications or stochastic gradient accumulation per iteration, and the transient times are the shortest under the setting to the best of our knowledge.
Problem

Research questions and friction points this paper is trying to address.

Distributed optimization with multiple agents minimizing local objectives
Achieving centralized SGD convergence rates in distributed settings
Reducing transient times for convergence under network topology constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributed Stochastic Momentum Tracking technique
Loopless Chebyshev Acceleration method
Single-loop algorithm for optimization
🔎 Similar Papers
No similar papers found.
K
Kun Huang
The Chinese University of Hong Kong, Shenzhen, School of Data Science (SDS), Shenzhen, Guangdong, China
Shi Pu
Shi Pu
贵州电信 China Telecom Guizhou Branch
Computer vision
A
Angelia Nedi'c
Arizona State University, School of Electrical, Computer and Energy Engineering, Tempe, AZ, United States