🤖 AI Summary
This work addresses the inefficiency of existing single-GPU heterogeneous inference systems in handling "warm experts" in Mixture-of-Experts (MoE) models, where cold experts are bottlenecked by host memory bandwidth and DIMM-based near-data processing (DIMM-NDP) lacks sufficient compute capability for warm experts, leading to imbalanced resource utilization. To bridge this gap, the authors propose TriMoE, a novel architecture that uniquely integrates AMX-accelerated CPU cores with DIMM-NDP. TriMoE enables fine-grained expert classification, bottleneck-aware scheduling, and prediction-driven dynamic re-layout to precisely map hot, warm, and cold experts to their optimal execution units. This approach effectively fills the computational void between GPU–CPU and GPU–NDP subsystems, achieving up to 2.83× higher inference throughput across diverse MoE models compared to state-of-the-art heterogeneous inference solutions.
📝 Abstract
To deploy large Mixture-of-Experts (MoE) models cost-effectively, offloading-based single-GPU heterogeneous inference is crucial. While GPU-CPU architectures that offload cold experts are constrained by host memory bandwidth, emerging GPU-NDP architectures utilize DIMM-NDP to offload non-hot experts. However, non-hot experts are not a homogeneous memory-bound group: a significant subset of warm experts exists is severely penalized by high GPU I/O latency yet can saturate NDP compute throughput, exposing a critical compute gap. We present TriMoE, a novel GPU-CPU-NDP architecture that fills this gap by synergistically leveraging AMX-enabled CPU to precisely map hot, warm, and cold experts onto their optimal compute units. We further introduce a bottleneck-aware expert scheduling policy and a prediction-driven dynamic relayout/rebalancing scheme. Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.