π€ AI Summary
This work addresses the inefficiency of existing Mixture-of-Experts (MoE) training frameworks on Ascend NPUs, which execute operators sequentially and fail to exploit the parallelism of heterogeneous compute unitsβAI Cores (AIC) and AI Vector cores (AIV). To overcome this limitation, we propose HyperParallel-MoE, a novel approach that reformulates MoE operators into block-level heterogeneous task streams spanning AIC and AIV. By leveraging a unified runtime, our method concurrently schedules both compute resources within a single kernel launch, enabling fine-grained overlap of communication, matrix, and vector operations. Key innovations include AIV-driven one-sided communication to eliminate host synchronization, a dependency-preserving block-task abstraction that unifies computation and communication, and event-driven static scheduling to reduce queue coordination overhead. Implemented atop MindSpore and MindFormers, evaluations on an Ascend A3 cluster demonstrate up to a 1.58Γ reduction in Dispatch-to-Combine latency, significantly enhancing MoE training efficiency.
π Abstract
Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized.
This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.