Convergence of Distributed Adaptive Optimization with Local Updates

📅 2024-09-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the theoretical benefits of local updates (i.e., intermittent communication) in distributed adaptive optimization, focusing on communication complexity reduction. We propose Local SGDM and Local Adam—distributed variants of SGDM and Adam with local gradient steps—and establish their first convergence guarantees for both convex and weakly convex objectives. Under appropriate parameter configurations, both algorithms provably outperform their minibatch counterparts by substantially reducing the required number of communication rounds. To overcome theoretical barriers posed by generalized smoothness and gradient clipping, we introduce a novel *local iteration contraction analysis*, enabling tight convergence rates in the weakly convex regime. This yields the first rigorous guarantee of convergence superiority for Local SGDM and Local Adam under weak convexity. Our results provide a solid theoretical foundation for efficient, low-communication distributed deep learning.

Technology Category

Application Category

📝 Abstract
We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, for the first time, we prove that em Local SGD em with momentum (em Local em SGDM) and em Local em Adam can outperform their minibatch counterparts in convex and weakly convex settings in certain regimes, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.
Problem

Research questions and friction points this paper is trying to address.

Explores distributed adaptive optimization with local updates.
Analyzes communication complexity reduction in adaptive methods.
Proves Local SGD and Local Adam outperform minibatch versions.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Local SGD with momentum
Local Adam optimization
Reduced communication complexity
🔎 Similar Papers
No similar papers found.
Ziheng Cheng
Ziheng Cheng
UC Berkeley
Machine LearningOptimizationStatistics
M
Margalit Glasgow
Massachusetts Institute of Technology