Collaborative Compression for Large-Scale MoE Deployment on Edge

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address storage and memory bottlenecks hindering the deployment of ultra-large-scale Mixture-of-Experts (MoE) models—such as the trillion-parameter DeepSeek-V3—on resource-constrained edge devices, this work proposes a holistic compression framework that jointly optimizes expert pruning, mixed-precision quantization, and activation optimization—the first such integrated approach. By breaking away from conventional single-paradigm compression, it mitigates accuracy and output quality degradation under high compression ratios. Experiments demonstrate a reduction in model storage footprint from 1.3 TB to 103 GB, enabling successful deployment on a 128-GB-memory-limited edge platform. Moreover, compared to uniform low-bit quantization, our method achieves higher benchmark accuracy at smaller model sizes. This synergy significantly enhances both practicality and energy efficiency of MoE models on edge devices.

Technology Category

Application Category

📝 Abstract
The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.
Problem

Research questions and friction points this paper is trying to address.

Reducing massive memory requirements for large MoE models on edge devices
Overcoming accuracy degradation from aggressive compression methods alone
Enabling deployment of ultra-large models under strict memory constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines expert pruning with mixed-precision quantization
Integrates activation optimization for enhanced compression
Reduces storage footprint while preserving output quality
🔎 Similar Papers
No similar papers found.
Y
Yixiao Chen
Northeastern University
Yanyue Xie
Yanyue Xie
ByteDance Seed
ML SystemEfficient Deep LearningAlgorithm-Hardware Co-Design
Ruining Yang
Ruining Yang
PhD Student, Northeastern University
deep learningtrajectory predictiondata efficiency
W
Wei Jiang
Futurewei Technologies, Inc.
W
Wei Wang
Futurewei Technologies, Inc.
Y
Yong He
Futurewei Technologies, Inc.
Y
Yue Chen
Futurewei Technologies, Inc.
P
Pu Zhao
Northeastern University
Y
Yanzhi Wang
Northeastern University