TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

πŸ“… 2026-03-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inefficiency of existing single-GPU heterogeneous inference systems in handling "warm experts" in Mixture-of-Experts (MoE) models, where cold experts are bottlenecked by host memory bandwidth and DIMM-based near-data processing (DIMM-NDP) lacks sufficient compute capability for warm experts, leading to imbalanced resource utilization. To bridge this gap, the authors propose TriMoE, a novel architecture that uniquely integrates AMX-accelerated CPU cores with DIMM-NDP. TriMoE enables fine-grained expert classification, bottleneck-aware scheduling, and prediction-driven dynamic re-layout to precisely map hot, warm, and cold experts to their optimal execution units. This approach effectively fills the computational void between GPU–CPU and GPU–NDP subsystems, achieving up to 2.83Γ— higher inference throughput across diverse MoE models compared to state-of-the-art heterogeneous inference solutions.

Technology Category

Application Category

πŸ“ Abstract
To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
heterogeneous inference
expert offloading
compute gap
warm experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
Heterogeneous Inference
Near-Data Processing
AMX
Expert Offloading
πŸ”Ž Similar Papers
No similar papers found.
Y
Yudong Pan
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Y
Yintao He
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
T
Tianhua Han
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
L
Lian Liu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
S
Shixin Zhao
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Zhirong Chen
Zhirong Chen
Master, Institute of Computing Technology, Chinese Academy of Sciences
Computer ArchitectureMachine Learning
Mengdi Wang
Mengdi Wang
Institute of Computing Technology, Chinese Academy of Sciences
accelerator architecture designmulti-core system
C
Cangyuan Li
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Y
Yinhe Han
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Ying Wang
Ying Wang
Institute of Computing Technology, Chinese Academy of Sciences
Reliable Computer ArchitectureVLSI designMachine learningMemory system