FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

265K/year

🤖 AI Summary

Communication bottlenecks severely limit training efficiency of Mixture-of-Experts (MoE) models in distributed settings. Method: This paper proposes an architecture-driven communication-computation overlap technique: it reconstructs the MoE structure across all network layers, introduces cross-layer skip connections to decouple expert routing from feed-forward computation, and integrates self-distillation for lossless knowledge transfer. The method is framework-agnostic—compatible with mainstream deep learning libraries—and supports models ranging from 16B to 109B parameters (e.g., Llama 4 Scout), requiring no modifications to underlying communication libraries. Contribution/Results: It extends the applicability of skip connections to MoE architectures for the first time and establishes a scalable system-algorithm co-optimization paradigm for efficient distributed MoE training. Experiments show up to 2.3× higher training throughput and 37% lower inference latency, with average downstream task accuracy degradation under 1%.

Technology Category

Application Category

📝 Abstract

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.

Problem

Research questions and friction points this paper is trying to address.

Overcoming blocking communication in distributed Mixture of Experts models

Maintaining model accuracy while modifying skip connections architecture

Enabling computation-communication overlap in large-scale MoE implementations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Skipping connections to overlap communication with computation

Converting models from 16B to 109B parameters

Achieving accuracy within 1% of original models

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions