CoSMoEs: Compact Sparse Mixture of Experts

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models face a fundamental trade-off among quality, memory footprint, and inference latency when deployed on edge devices. This paper proposes Compact Sparse MoE (CoSMoE), a co-designed solution addressing these challenges through three key innovations: (1) a weight-decomposed expert architecture that drastically reduces parameter count; (2) a FLOP-aligned fair evaluation framework, enabling the first empirical demonstration that small-scale MoE outperforms dense baselines under identical computational budgets; and (3) an integrated strategy combining model sharding with dynamic offloading to enhance real-time edge inference. Experiments show that CoSMoE maintains or improves model quality while reducing memory consumption by 42% and decreasing edge-side latency by 3.1×. These results establish CoSMoE as a high-quality, low-overhead paradigm for on-device AI inference on resource-constrained platforms.

Technology Category

Application Category

📝 Abstract
Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.
Problem

Research questions and friction points this paper is trying to address.

Enable Compact Sparse Mixture of Experts for on-device inference
Improve model quality, memory, and latency for on-device use
Enhance offloading efficiency and reduce inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact Sparse mixture for on-device inference
Weight-decomposed experts enhance model performance
Improved offloading efficiency reduces inference latency
🔎 Similar Papers
No similar papers found.