MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high data transfer overhead of inactive experts during MoE inference under GPU memory constraints, this paper proposes an error-bounded lossy compression-based expert offloading mechanism, leveraging SZ3 and CuSZp to efficiently compress and transfer inactive expert parameters between CPU and GPU. Key findings reveal a previously unobserved layer-wise sensitivity disparity among MoE experts: shallow and deep experts exhibit robustness to compression errors—sometimes even yielding slight accuracy gains—whereas middle-layer experts are highly sensitive, with even small errors causing significant performance degradation. Experiments demonstrate that the proposed method substantially reduces inter-device data transfer overhead while maintaining or improving overall inference accuracy. This work establishes a new paradigm for efficient MoE deployment in resource-constrained environments.

Technology Category

Application Category

📝 Abstract
With the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance. To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy. The results indicate that experts in the shallow layers, which are primarily responsible for the attention mechanism and the transformation of input tokens into vector representations, exhibit minimal degradation in inference accuracy when subjected to bounded errors. In contrast, errors in the middle-layer experts, which are central to model reasoning, significantly impair inference accuracy. Interestingly, introducing bounded errors in the deep-layer experts, which are mainly responsible for instruction following and output integration, can sometimes lead to improvements in inference accuracy.
Problem

Research questions and friction points this paper is trying to address.

Compressing non-activated experts to reduce GPU memory transfer overhead
Analyzing how compression errors in different expert layers affect inference accuracy
Identifying which expert layers are most sensitive to compression-induced errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Error-bounded compression for non-activated experts
Analyzing layer-specific compression error impacts
Reducing data transfer overhead in MoE inference
S
Songkai Ma
Department of Computing, Hong Kong Polytechnic University, Hong Kong
Zhaorui Zhang
Zhaorui Zhang
The Hong Kong Polytechnic University, Department of Computing
LLM and MLSysHPCDistributed & Parallel SystemCloud ComputingFPGA
Sheng Di
Sheng Di
Argonne National Labratory, IEEE Senior Member
HPCData CompressionResilienceCloud/Grid Computing/P2PFederated Learning
B
Benben Liu
LSCM R&D Center, The University of Hong Kong, Hong Kong
X
Xiaodong Yu
Department of Computer Science, Stevens Institute of Technology, USA
Xiaoyi Lu
Xiaoyi Lu
Associate Professor, University of California, Merced
Big DataHigh Performance ComputingCloud ComputingDeep LearningDistributed Computing
D
Dan Wang
Department of Computing, Hong Kong Polytechnic University, Hong Kong