AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the scalability limitations of data and pipeline parallelism in distributed training, which stem from high communication overhead and reliance on high-speed interconnects. To overcome these challenges, the paper introduces a fully asynchronous optimization framework that unifies both parallel paradigms for the first time. The approach mitigates gradient staleness through weight prediction and incorporates an asynchronous sparse averaging strategy with exponential moving average correction to ensure convergence while relaxing device colocation requirements. Experimental results demonstrate that the proposed method achieves training performance comparable to fully synchronous baselines on billion-parameter language models, while substantially reducing communication costs.

Technology Category

Application Category

📝 Abstract

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.

Problem

Research questions and friction points this paper is trying to address.

data parallelism

pipeline parallelism

communication bottleneck

distributed training

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous optimization

pipeline parallelism

data parallelism