Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
MoE model inference frequently incurs high-overhead external weight offloading and reloading due to expert weights exceeding GPU memory capacity. To address this memory wall, we propose a CXL-interconnect-enabled near-data processing (NDP) co-inference framework. Our method leverages activation statistics gathered during the prefill phase to perform context-aware, dynamic hot/cold expert partitioning; cold experts are executed in situ on the CXL-NDP, reducing data movement from high-bandwidth parameter transfers to low-bandwidth activation transfers. We further integrate expert-granular 1–4-bit mixed-precision quantization, HBM-resident caching of hot experts, and GPU–NDP compute–communication overlap. Evaluated on a CXL-enabled GPU–NDP system, our approach achieves up to 8.7× higher decoding throughput while incurring only a 0.13% average accuracy degradation—significantly alleviating the memory bottleneck in MoE inference.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) models scale large language models through conditional computation, but inference becomes memory-bound once expert weights exceed the capacity of GPU memory. In this case, weights must be offloaded to external memory, and fetching them incurs costly and repeated transfers. We address this by adopting CXL-attached near-data processing (CXL-NDP) as the offloading tier to execute cold experts in place, converting expensive parameter movement into cheaper activation movement. Unlike prior GPU-NDP systems that are largely context-agnostic and reactive, we develop a context-aware MoE system that uses prefill-stage activation statistics to guide decoding-stage expert placement, dynamically pins hot experts in GPU-side HBM, and maps the remainder to CXL-NDP. To meet NDP's limited compute throughput, we introduce context-aware mixed-precision quantization that allocates per-expert bitwidths (1-4 bit) based on prefill stage. The resulting MoE inference system overlaps GPU and NDP execution while minimizing cross-device movement. The evaluation on the GPU-NDP system shows that our approach achieves up to an 8.7-fold decoding throughput improvement over the state-of-the-art method, while incurring only a 0.13% average accuracy drop.
Problem

Research questions and friction points this paper is trying to address.

Optimizes MoE inference by offloading cold experts to CXL-attached near-data processing
Addresses memory-bound inference when expert weights exceed GPU memory capacity
Reduces costly parameter transfers by converting them to cheaper activation movement
Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL-attached near-data processing for cold expert execution
Context-aware expert placement using prefill activation statistics
Mixed-precision quantization based on prefill stage per expert
🔎 Similar Papers
No similar papers found.
Z
Zehao Fan
Rensselaer Polytechnic Institute, Troy, NY, USA
Z
Zhenyu Liu
Rensselaer Polytechnic Institute, Troy, NY, USA
Y
Yunzhen Liu
University of Massachusetts Amherst, Amherst, MA, USA
Y
Yayue Hou
Rensselaer Polytechnic Institute, Troy, NY, USA
Hadjer Benmeziane
Hadjer Benmeziane
Research Scientist @ IBM Research
Efficient Deep LearningAutoMLNAS
K
Kaoutar El Maghraoui
IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
L
Liu Liu
Rensselaer Polytechnic Institute, Troy, NY, USA