🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models face a fundamental cost-accuracy-performance (CAP) trade-off on heterogeneous hardware, yet no systematic benchmark exists to jointly quantify these interdependent dimensions. Method: This paper proposes the first three-dimensional benchmarking framework for sparse MoE, built upon a sparse-aware CAP co-analysis model that integrates hardware-aware modeling, dynamic sparsity activation profiling, multi-dimensional metric normalization, and system-level simulation—enabling unified quantification and single-plot visualization of CAP coupling. Contribution/Results: It is the first to quantitatively characterize how sparsity impacts end-to-end system performance across mainstream MoE architectures. Experiments show a 37% reduction in CAP estimation error versus state-of-the-art baselines, delivering a reproducible, interpretable, and scalable quantitative foundation for model-hardware co-design.
📝 Abstract
The Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs). Its key feature, sparse activation, selectively activates only a subset of parameters (experts) per token, reducing memory bandwidth and compute FLOPs compared to dense models. To capitalize on this, MoE designers leverage heterogeneous compute and memory hardware to lower system costs. However, the interaction between model sparsity and hardware heterogeneity introduces trade-offs in Cost, Accuracy, and Performance (CAP). To address this, we introduce MoE-CAP, a benchmarking method for evaluating sparse MoE systems across these three dimensions. Its key innovation is a sparsity-aware CAP analysis model, the first to integrate cost, performance, and accuracy metrics into a single diagram while estimating the impact of sparsity on system performance. MoE-CAP helps practitioners optimize hardware provisioning for an MoE model-or vice versa. MoE-CAP supports various MoE models and provides more accurate metrics than existing methods.