HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the inefficiency of existing Mixture-of-Experts (MoE) training frameworks on Ascend NPUs, which execute operators sequentially and fail to exploit the parallelism of heterogeneous compute units—AI Cores (AIC) and AI Vector cores (AIV). To overcome this limitation, we propose HyperParallel-MoE, a novel approach that reformulates MoE operators into block-level heterogeneous task streams spanning AIC and AIV. By leveraging a unified runtime, our method concurrently schedules both compute resources within a single kernel launch, enabling fine-grained overlap of communication, matrix, and vector operations. Key innovations include AIV-driven one-sided communication to eliminate host synchronization, a dependency-preserving block-task abstraction that unifies computation and communication, and event-driven static scheduling to reduce queue coordination overhead. Implemented atop MindSpore and MindFormers, evaluations on an Ascend A3 cluster demonstrate up to a 1.58× reduction in Dispatch-to-Combine latency, significantly enhancing MoE training efficiency.

📝 Abstract

Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

heterogeneous parallelism

Ascend NPUs

MoE training

underutilized compute resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Heterogeneous Scheduling

Tile-level Parallelism