A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses three key challenges in MoE model inference on edge GPU–NDP systems: imbalanced NDP workload, underutilized GPU resources, and high pre-analysis overhead caused by unpredictable expert activation. To tackle these issues, the paper introduces tensor parallelism for the first time in low-batch edge scenarios, integrating it with a load-aware scheduling strategy and a pre-analysis-free expert prefetching mechanism to form a cooperative GPU–NDP computing architecture. The proposed approach significantly enhances resource utilization and inference efficiency, achieving an average end-to-end latency speedup of 2.41Γ— and up to 2.56Γ— over existing methods, demonstrating clear superiority in edge deployment settings.

Technology Category

Application Category

πŸ“ Abstract
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
Edge Computing
Near-Data Processing
Load Imbalance
GPU Utilization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Near-Data Processing
Tensor Parallelism
Load-Balancing Scheduling
Dataset-Free Prefetching
πŸ”Ž Similar Papers
No similar papers found.
Q
Qi Wu
School of Electronic Science and Engineering, Nanjing University, China
Chao Fang
Chao Fang
Shanghai Qi Zhi Institute
efficient MLAI acceleratorhardware-software co-designprecision-scalable computingRISC-V
J
Jiayuan Chen
China Mobile Research Institute, China
Y
Ye Lin
School of Electronic Science and Engineering, Nanjing University, China
Yueqi Zhang
Yueqi Zhang
Beijing Institute of Technology
NLPLLM
Y
Yichuan Bai
School of Electronic Science and Engineering, Nanjing University, China
Y
Yuan Du
School of Electronic Science and Engineering, Nanjing University, China
L
Li Du
School of Electronic Science and Engineering, Nanjing University, China