π€ AI Summary
This work addresses the inefficiency of existing single-GPU heterogeneous inference systems in handling "warm experts" in Mixture-of-Experts (MoE) models, where cold experts are bottlenecked by host memory bandwidth and DIMM-based near-data processing (DIMM-NDP) lacks sufficient compute capability for warm experts, leading to imbalanced resource utilization. To bridge this gap, the authors propose TriMoE, a novel architecture that uniquely integrates AMX-accelerated CPU cores with DIMM-NDP. TriMoE enables fine-grained expert classification, bottleneck-aware scheduling, and prediction-driven dynamic re-layout to precisely map hot, warm, and cold experts to their optimal execution units. This approach effectively fills the computational void between GPUβCPU and GPUβNDP subsystems, achieving up to 2.83Γ higher inference throughput across diverse MoE models compared to state-of-the-art heterogeneous inference solutions.
π Abstract
To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.