AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

📅 2026-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scalability limitations of data and pipeline parallelism in distributed training, which stem from high communication overhead and reliance on high-speed interconnects. To overcome these challenges, the paper introduces a fully asynchronous optimization framework that unifies both parallel paradigms for the first time. The approach mitigates gradient staleness through weight prediction and incorporates an asynchronous sparse averaging strategy with exponential moving average correction to ensure convergence while relaxing device colocation requirements. Experimental results demonstrate that the proposed method achieves training performance comparable to fully synchronous baselines on billion-parameter language models, while substantially reducing communication costs.

Technology Category

Application Category

📝 Abstract
Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.
Problem

Research questions and friction points this paper is trying to address.

data parallelism
pipeline parallelism
communication bottleneck
distributed training
scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous optimization
pipeline parallelism
data parallelism
sparse averaging
weight look-ahead
🔎 Similar Papers
No similar papers found.