🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models face a fundamental trade-off among quality, memory footprint, and inference latency when deployed on edge devices. This paper proposes Compact Sparse MoE (CoSMoE), a co-designed solution addressing these challenges through three key innovations: (1) a weight-decomposed expert architecture that drastically reduces parameter count; (2) a FLOP-aligned fair evaluation framework, enabling the first empirical demonstration that small-scale MoE outperforms dense baselines under identical computational budgets; and (3) an integrated strategy combining model sharding with dynamic offloading to enhance real-time edge inference. Experiments show that CoSMoE maintains or improves model quality while reducing memory consumption by 42% and decreasing edge-side latency by 3.1×. These results establish CoSMoE as a high-quality, low-overhead paradigm for on-device AI inference on resource-constrained platforms.
📝 Abstract
Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.