Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

In large-scale Mixture-of-Experts (MoE) model training, inter-device communication overhead in MoE layers constitutes 47% of total execution time, while existing coarse-grained compute-communication overlap schemes suffer from suboptimal efficiency and insufficient latency hiding. To address this, we propose the first data-dependency-driven fine-grained communication-computation overlap framework. Our approach leverages dynamic data-dependency analysis, task-level rescheduling, and adaptive load balancing to enable precise coordination between communication and computation, thereby eliminating critical bottlenecks. Evaluated on real-world MoE workloads, our method achieves 1.96× speedup per MoE layer and 1.71× end-to-end average acceleration across diverse scenarios, demonstrating strong cross-scenario generalization. The framework has been deployed at scale on a 10,000-GPU cluster, saving millions of GPU-hours.

Technology Category

Application Category

📝 Abstract

Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96 imes$ and for end-to-end execution, COMET delivers a $1.71 imes$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication overhead in MoE models

Enhances communication-computation overlapping efficiency

Optimizes MoE system for distributed large-scale GPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained communication-computation overlapping

Data dependency analysis

Adaptive workload assignment

🔎 Similar Papers

No Need to Talk: Asynchronous Mixture of Language Models