HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high communication overhead and computational load imbalance in Mixture-of-Experts (MoE) large language models deployed on Near-Memory Processing (NMP) architectures—caused by dynamic expert routing—this paper proposes a hybrid and dynamic parallel optimization framework. Methodologically, it introduces an automated offline hybrid mapping scheme that jointly integrates tensor parallelism, expert parallelism, and NMP-aware fine-grained data placement, coupled with an online dynamic scheduling strategy to co-optimize communication and computation resources within 3D-stacked memory. Its key innovation lies in decoupling static memory mapping from runtime routing adaptation, thereby achieving load balancing and low-latency communication under strict NMP constraints. Experimental results demonstrate 1.1×–1.8× inference speedup over conventional tensor/ expert parallelism and existing hybrid approaches, significantly improving energy efficiency and throughput of MoE models on NMP architectures.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) with Mixture-of-Expert (MoE) architectures achieve superior model performance with reduced computation costs, but at the cost of high memory capacity and bandwidth requirements. Near-Memory Processing (NMP) accelerators that stack memory directly on the compute through hybrid bonding have demonstrated high bandwidth with high energy efficiency, becoming a promising architecture for MoE models. However, as NMP accelerators comprise distributed memory and computation, how to map the MoE computation directly determines the LLM inference efficiency. Existing parallel mapping strategies, including Tensor Parallelism (TP) and Expert Parallelism (EP), suffer from either high communication costs or unbalanced computation utilization, leading to inferior efficiency. The dynamic routing mechanism of MoE LLMs further aggravates the efficiency challenges. Therefore, in this paper, we propose HD-MoE to automatically optimize the MoE parallel computation across an NMP accelerator. HD-MoE features an offline automatic hybrid parallel mapping algorithm and an online dynamic scheduling strategy to reduce the communication costs while maximizing the computation utilization. With extensive experimental results, we demonstrate that HD-MoE achieves a speedup ranging from 1.1x to 1.8x over TP, 1.1x to 1.5x over EP, and 1.0x to 1.4x over the baseline Hybrid TP-EP with Compute-Balanced parallelism strategies.
Problem

Research questions and friction points this paper is trying to address.

Optimizing MoE LLM computation mapping on NMP accelerators
Reducing communication costs and balancing computation utilization
Addressing dynamic routing challenges in MoE inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid automatic parallel mapping algorithm
Online dynamic scheduling strategy optimization
3D near-memory processing accelerator utilization
🔎 Similar Papers
No similar papers found.
Haochen Huang
Haochen Huang
University of California San Diego
system/software reliabilitysecurity
Shuzhang Zhong
Shuzhang Zhong
Peking University
Machine Learning System
Z
Zhe Zhang
DAMO Academy, Alibaba Group, Beijing, China; Hupan Lab, Hangzhou, China
Shuangchen Li
Shuangchen Li
Research Scientist, DAMO Academy, Alibaba Group
Computer ArchitectureElectronic Design Automation
Dimin Niu
Dimin Niu
Computing Technology Lab, Alibaba DAMO Academy
Computer ArchitectureMemory SystemsProcessing-in-MemoryDeep Learning
H
Hongzhong Zheng
DAMO Academy, Alibaba Group, Beijing, China; Hupan Lab, Hangzhou, China
R
Runsheng Wang
Institute of Electronic Design Automation, Peking University, Wuxi, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China
M
Meng Li
Institute for Artificial Intelligence, Peking University, Beijing, China; School of Integrated Circuits, Peking University, Beijing, China; Beijing Advanced Innovation Center for Integrated Circuits, Beijing, China