HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs

πŸ“… 2026-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

236K/year
πŸ€– AI Summary
This work addresses the inefficiency of existing Mixture-of-Experts (MoE) training frameworks on Ascend NPUs, which execute operators sequentially and fail to exploit the parallelism of heterogeneous compute unitsβ€”AI Cores (AIC) and AI Vector cores (AIV). To overcome this limitation, we propose HyperParallel-MoE, a novel approach that reformulates MoE operators into block-level heterogeneous task streams spanning AIC and AIV. By leveraging a unified runtime, our method concurrently schedules both compute resources within a single kernel launch, enabling fine-grained overlap of communication, matrix, and vector operations. Key innovations include AIV-driven one-sided communication to eliminate host synchronization, a dependency-preserving block-task abstraction that unifies computation and communication, and event-driven static scheduling to reduce queue coordination overhead. Implemented atop MindSpore and MindFormers, evaluations on an Ascend A3 cluster demonstrate up to a 1.58Γ— reduction in Dispatch-to-Combine latency, significantly enhancing MoE training efficiency.
πŸ“ Abstract
Modern Mixture-of-Experts (MoE) models increasingly rely on large-scale AI accelerator clusters for efficient training. Ascend NPUs expose heterogeneous on-chip compute resources, including matrix-oriented AIC units and vector-oriented AIV units with explicit cross-queue synchronization support. However, existing training frameworks largely execute MoE operators in a serialized kernel-by-kernel manner, leaving substantial heterogeneous parallelism underutilized. This paper presents HyperParallel-MoE, a compilation and scheduling framework for MoE training on Ascend NPUs. HyperParallel-MoE transforms operator-level MoE execution into a statically scheduled tile-level heterogeneous taskflow spanning AIC and AIV resources. It introduces AIV-driven one-sided communication to eliminate host-side collective synchronization, dependency-preserving tile task generation to unify communication and computation under a common task abstraction, and event-driven static scheduling to coordinate cross-queue execution with low runtime overhead. HyperParallel-MoE further executes the compiled taskflow within a unified runtime that concurrently drives AIC and AIV workers inside a single kernel launch, enabling fine-grained overlap among communication, matrix computation, and vector computation while preserving existing optimized operators. We implement HyperParallel-MoE in the MindSpore and MindFormers stack and evaluate it using DeepSeek-style MoE models on Ascend A3 clusters. Across multiple expert-parallel configurations, HyperParallel-MoE reduces Dispatch-to-Combine MoE-FFN latency by up to 1.58x, demonstrating that tile-level heterogeneous scheduling can substantially improve MoE training efficiency on modern NPUs.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
heterogeneous parallelism
Ascend NPUs
MoE training
underutilized compute resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Heterogeneous Scheduling
Tile-level Parallelism
Ascend NPU
Static Taskflow
πŸ”Ž Similar Papers
No similar papers found.