TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the inefficiency of existing single-GPU heterogeneous inference systems in handling "warm experts" in Mixture-of-Experts (MoE) models, where cold experts are bottlenecked by host memory bandwidth and DIMM-based near-data processing (DIMM-NDP) lacks sufficient compute capability for warm experts, leading to imbalanced resource utilization. To bridge this gap, the authors propose TriMoE, a novel architecture that uniquely integrates AMX-accelerated CPU cores with DIMM-NDP. TriMoE enables fine-grained expert classification, bottleneck-aware scheduling, and prediction-driven dynamic re-layout to precisely map hot, warm, and cold experts to their optimal execution units. This approach effectively fills the computational void between GPU–CPU and GPU–NDP subsystems, achieving up to 2.83× higher inference throughput across diverse MoE models compared to state-of-the-art heterogeneous inference solutions.

Technology Category

Application Category

📝 Abstract

To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

heterogeneous inference

expert offloading

compute gap

warm experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Heterogeneous Inference

Near-Data Processing