Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of low-batch Mixture-of-Experts (MoE) inference on edge devices, which suffers from limited on-chip memory, load imbalance, and frequent off-chip memory accesses. To overcome these limitations, the authors propose a Fully Sharded Expert Data Parallelism (FSE-DP) paradigm tailored for multi-chiplet accelerators. This approach integrates a novel dynamic expert trajectory scheduling mechanism with hardware-friendly virtualization rules, enabling fine-grained dynamic scheduling of expert streams over high-bandwidth chiplet interconnects. The design effectively overlaps computation and communication while achieving balanced workload distribution. Evaluated on multi-chiplet architectures, the proposed method achieves 1.22–2.00× speedup over the state-of-the-art baseline and reduces on-chip memory usage by up to 78.8%.
📝 Abstract
Mixture-of-Experts is a promising approach for edge AI with low-batch inference. Yet, on-device deployments often face limited on-chip memory and severe workload imbalance; the prevalent use of offloading further incurs off-chip memory access bottlenecks. Moreover, MoE sparsity and dynamic gating shift distributed strategies toward much finer granularity and introduce runtime scheduling considerations. Recently, high die-to-die bandwidth chiplet interconnects have created new opportunities for multi-chiplet systems to address workload imbalance and offloading bottlenecks with fine-grained scheduling. In this paper, we propose Fully Sharded Expert Data Parallelism, a parallelization paradigm specifically architected for low-batch MoE inference on multi-chiplet accelerators. FSE-DP attains adaptive computation-communication overlap and balanced load by orchestrating fine-grained, complementary expert streams along dynamic trajectories across high-bandwidth D2D links. The attendant dataflow complexity is tamed by a minimal, hardware-amenable set of virtualization rules and a lightweight scheduling algorithm. Our approach achieves 1.22 to 2.00 times speedup over state-of-the-art baselines and saves up to 78.8 percent on-chip memory.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
low-batch inference
workload imbalance
off-chip memory bottleneck
fine-grained scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
multi-chiplet architecture
dynamic expert scheduling
Fully Sharded Expert Data Parallelism
low-batch inference
🔎 Similar Papers
No similar papers found.
S
Songchen Ma
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China
H
Hongyi Li
The Hong Kong University of Science and Technology, Hong Kong SAR, China
W
Weihao Zhang
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China
Yonghao Tan
Yonghao Tan
The Hong Kong University of Science and Technology
AI AcceleratorComputer VisionVLSI
Pingcheng Dong
Pingcheng Dong
Hong Kong University of Science and Technology
AI ChipModel CompressionHW/SW Co-Design
Yu Liu
Yu Liu
Assistant Professor, Department of Computing, Hong Kong Polytechnic University
Edge AIDistributed Quantum Computing
L
Lan Liu
Shanghai UniVista Industrial Software Group Co., Ltd., Shanghai, China
Y
Yuzhong Jiao
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China
X
Xuejiao Liu
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China
L
Luhong Liang
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China
K
Kwang-Ting Cheng
AI Chip Center for Emerging Smart Systems, Hong Kong SAR, China