CoSMoEs: Compact Sparse Mixture of Experts

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Sparse Mixture-of-Experts (MoE) models face a fundamental trade-off among quality, memory footprint, and inference latency when deployed on edge devices. This paper proposes Compact Sparse MoE (CoSMoE), a co-designed solution addressing these challenges through three key innovations: (1) a weight-decomposed expert architecture that drastically reduces parameter count; (2) a FLOP-aligned fair evaluation framework, enabling the first empirical demonstration that small-scale MoE outperforms dense baselines under identical computational budgets; and (3) an integrated strategy combining model sharding with dynamic offloading to enhance real-time edge inference. Experiments show that CoSMoE maintains or improves model quality while reducing memory consumption by 42% and decreasing edge-side latency by 3.1×. These results establish CoSMoE as a high-quality, low-overhead paradigm for on-device AI inference on resource-constrained platforms.

Technology Category

Application Category

📝 Abstract

Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

Problem

Research questions and friction points this paper is trying to address.

Enable Compact Sparse Mixture of Experts for on-device inference

Improve model quality, memory, and latency for on-device use

Enhance offloading efficiency and reduce inference latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact Sparse mixture for on-device inference

Weight-decomposed experts enhance model performance

Improved offloading efficiency reduces inference latency

🔎 Similar Papers

No similar papers found.