Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low parallel efficiency and suboptimal Tensor Core utilization of irregular sparse computations—such as Mixture-of-Experts (MoE)—on GPUs, this paper proposes a novel execution paradigm that synergistically combines static batching with dynamic task mapping. It statically compiles a dense task graph during compilation, transforming dynamic sparse inference into a single-kernel execution; a lightweight runtime scheduler then enables fine-grained task mapping onto hardware resources. This approach achieves, for the first time, highly efficient, targeted Tensor Core computation for MoE inference, attaining 91% and 95% of peak throughput utilization on NVIDIA H800 and H20 GPUs, respectively—significantly outperforming existing dynamic batching methods. The core contribution is a pioneering “compiler–runtime” co-optimization framework, establishing a new paradigm for high-throughput deployment of sparse models on hardware accelerators.

Technology Category

Application Category

📝 Abstract
It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.
Problem

Research questions and friction points this paper is trying to address.

GPU Efficiency
Irregular Tasks
Mixture-of-Experts Model
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU efficiency
Mixture-of-Experts optimization
CUDA programming
🔎 Similar Papers
No similar papers found.
Y
Yinghan Li
Alibaba Group
Y
Yifei Li
Alibaba Group
J
Jiejing Zhang
Alibaba Group
B
Bujiao Chen
Alibaba Group
X
Xiaotong Chen
Alibaba Group
Lian Duan
Lian Duan
Associate Professor of Information Systems, Hofstra University
Data MiningMachine Learning
Y
Yejun Jin
Alibaba Group
Z
Zheng Li
Alibaba Group
X
Xuanyu Liu
Alibaba Group
H
Haoyu Wang
Alibaba Group
W
Wente Wang
Alibaba Group
Yajie Wang
Yajie Wang
Beijing Institute of Technology
Jiacheng Yang
Jiacheng Yang
Nanjing University
🧠 Large Multimodal Models💪 Reinforcement Learning🥽 Visual Reasoning
P
Peiyang Zhang
Alibaba Group
L
Laiwen Zheng
Alibaba Group
Wenyuan Yu
Wenyuan Yu
Alibaba Group
Graph computationdata managementdistributed systems and parallel computation