🤖 AI Summary
This work addresses the challenge of deploying mixture-of-experts (MoE) models on edge devices, where excessive memory consumption hinders lossless inference. To overcome this, the authors propose a synergistic mechanism that integrates cache-affinity-aware scheduling with lossless compression, leveraging both hardware characteristics and statistical redundancy in MoE parameters. This approach shifts the inference bottleneck from I/O-bound to compute-bound while preserving semantic fidelity. Notably, it is the first method to significantly enhance MoE inference efficiency on edge devices under strict lossless constraints, offering provable performance guarantees. Experimental results on representative edge platforms demonstrate up to a 72.77% reduction in inference latency and a 6.76× increase in throughput.
📝 Abstract
While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.