🤖 AI Summary
In large-scale data-parallel training, frequent global communication severely limits scalability and robustness. To address this, we propose Pseudo-Asynchronous Local SGD (PALSGD), a novel algorithm that dynamically elongates synchronization intervals via a pseudo-synchronous mechanism—reducing communication frequency while preserving model consistency, thereby relaxing the strict synchronization requirements of conventional Local SGD and DiLoCo. Theoretically, we provide the first rigorous convergence proof and explicit convergence rate analysis for this class of algorithms. Technically, PALSGD integrates gradient delay compensation, dynamic synchronization scheduling, and consistency-preserving mechanisms. Experiments on ImageNet-1K and TinyStories demonstrate that PALSGD achieves up to 24.4% speedup over DDP with no accuracy degradation, significantly improving training efficiency and scalability.
📝 Abstract
Following AI scaling trends, frontier models continue to grow in size and continue to be trained on larger datasets. Training these models requires huge investments in exascale computational resources, which has in turn driven development of distributed deep learning methods. Data parallelism is an essential approach to speed up training, but it requires frequent global communication between workers, which can bottleneck training at the largest scales. In this work, we propose a method called Pseudo-Asynchronous Local SGD (PALSGD) to improve the efficiency of data-parallel training. PALSGD is an extension of Local SGD (Stich, 2018) and DiLoCo (Douillard et al., 2023), designed to further reduce communication frequency by introducing a pseudo-synchronization mechanism. PALSGD allows the use of longer synchronization intervals compared to standard Local SGD. Despite the reduced communication frequency, the pseudo-synchronization approach ensures that model consistency is maintained, leading to performance results comparable to those achieved with more frequent synchronization. Furthermore, we provide a theoretical analysis of PALSGD, establishing its convergence and deriving its convergence rate. This analysis offers insights into the algorithm's behavior and performance guarantees. We evaluated PALSGD on image classification and language modeling tasks. Our results show that PALSGD achieves better performance in less time compared to existing methods like Distributed Data Parallel (DDP), and DiLoCo. Notably, PALSGD trains 18.4% faster than DDP on ImageNet-1K with ResNet-50, 24.4% faster than DDP on TinyStories with GPT-Neo125M, and 21.1% faster than DDP on TinyStories with GPT-Neo-8M.