HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor adaptability of static parallelism strategies in Mixture-of-Experts (MoE) model inference, this paper proposes HAP—a dynamic hybrid parallelism optimization framework. HAP innovatively decouples the MoE architecture into independent computational modules (e.g., attention and experts), constructs a lightweight modular latency simulation model, and automatically identifies the optimal parallel configuration—including tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP)—via integer linear programming (ILP). Evaluated across diverse hardware (A100, A6000, V100) and MoE models (Mixtral, Qwen), HAP achieves 1.68×, 1.77×, and 1.57× inference speedup over state-of-the-art TP-only baselines, respectively. By enabling hardware- and model-agnostic deployment, HAP significantly improves both inference efficiency and generalizability of MoE models.

Technology Category

Application Category

📝 Abstract
Current inference systems for Mixture-of-Experts (MoE) models primarily employ static parallelization strategies. However, these static approaches cannot consistently achieve optimal performance across different inference scenarios, as they lack the flexibility to adapt to varying computational requirements. In this work, we propose HAP (Hybrid Adaptive Parallelism), a novel method that dynamically selects hybrid parallel strategies to enhance MoE inference efficiency. The fundamental innovation of HAP lies in hierarchically decomposing MoE architectures into two distinct computational modules: the Attention module and the Expert module, each augmented with a specialized inference latency simulation model. This decomposition promotes the construction of a comprehensive search space for seeking model parallel strategies. By leveraging Integer Linear Programming (ILP), HAP could solve the optimal hybrid parallel configurations to maximize inference efficiency under varying computational constraints. Our experiments demonstrate that HAP consistently determines parallel configurations that achieve comparable or superior performance to the TP strategy prevalent in mainstream inference systems. Compared to the TP-based inference, HAP-based inference achieves speedups of 1.68x, 1.77x, and 1.57x on A100, A6000, and V100 GPU platforms, respectively. Furthermore, HAP showcases remarkable generalization capability, maintaining performance effectiveness across diverse MoE model configurations, including Mixtral and Qwen series models.
Problem

Research questions and friction points this paper is trying to address.

Dynamic hybrid parallel strategies for efficient MoE inference
Adaptive optimization under varying computational constraints
Hierarchical decomposition of MoE architectures for parallelism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid adaptive parallelism for MoE inference
Hierarchical decomposition with latency simulation
Integer Linear Programming optimizes parallel configurations
H
Haoran Lin
School of Software, Shandong University, Jinan, China
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
K
Kang Zhao
Huawei Noah’s Ark Lab, Beijing, China
H
Han Bao
Huawei Noah’s Ark Lab, Beijing, China
Z
Zongyuan Zhan
Huawei Noah’s Ark Lab, Beijing, China
Ting Hu
Ting Hu
Associate Professor, School of Computing, Queen's University, Canada
Explainable AIEvolutionary ComputingMachine LearningBioinformatics
Wulong Liu
Wulong Liu
Unknown affiliation
Reinforcement LearningAutonomous DrivingRoboticsAI InfraEDA
Z
Zekun Yin
School of Software, Shandong University, Jinan, China
X
Xin Li
School of Software, Shandong University, Jinan, China
Weiguo Liu
Weiguo Liu
Shandong University
High Performance ComputingBig Data