DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing upcycled Mixture-of-Experts (MoE) models suffer from severe parameter redundancy after converting dense feed-forward layers into expert structures, leading to suboptimal training and inference efficiency. This work first identifies the intrinsic redundancy mechanism within upsampled MoE experts and proposes the Decompose–Replace–Synthesize (DeRS) paradigm. DeRS decouples each expert into a shared base weight matrix and a lightweight delta representation, enabling efficient expert upsampling during training and extreme model compression at inference time. The method integrates low-rank/sparse modeling, weight quantization, and structured pruning. Evaluated across three diverse task categories, DeRS achieves up to 90% parameter compression, 40% reduction in training memory footprint, and 35% lower inference latency—without any performance degradation.

Technology Category

Application Category

📝 Abstract

Upcycled Mixture-of-Experts (MoE) models have shown great potential in various tasks by converting the original Feed-Forward Network (FFN) layers in pre-trained dense models into MoE layers. However, these models still suffer from significant parameter inefficiency due to the introduction of multiple experts. In this work, we propose a novel DeRS (Decompose, Replace, and Synthesis) paradigm to overcome this shortcoming, which is motivated by our observations about the unique redundancy mechanisms of upcycled MoE experts. Specifically, DeRS decomposes the experts into one expert-shared base weight and multiple expert-specific delta weights, and subsequently represents these delta weights in lightweight forms. Our proposed DeRS paradigm can be applied to enhance parameter efficiency in two different scenarios, including: 1) DeRS Compression for inference stage, using sparsification or quantization to compress vanilla upcycled MoE models; and 2) DeRS Upcycling for training stage, employing lightweight sparse or low-rank matrixes to efficiently upcycle dense models into MoE models. Extensive experiments across three different tasks show that the proposed methods can achieve extreme parameter efficiency while maintaining the performance for both training and compression of upcycled MoE models.

Problem

Research questions and friction points this paper is trying to address.

Enhances parameter efficiency in upcycled Mixture-of-Experts models.

Addresses redundancy in experts by decomposing and representing weights.

Improves training and compression efficiency for MoE models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose experts into shared base and delta weights

Use lightweight forms for expert-specific delta weights

Apply DeRS for compression and upcycling scenarios

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts