DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address expert redundancy and insufficient diversity in Mixture-of-Experts (MoE) model reconstruction, this paper proposes a calibration-data-driven three-stage reconstruction paradigm: (1) leveraging multi-domain calibration data to quantify domain affinity and measure expert diversity; (2) performing structured pruning and expert recombination at the FFN-module level; and (3) applying parameter-efficient fine-tuning to routers and normalization layers for modular, lightweight retraining. This work is the first to explicitly use calibration data to guide expert differentiation, eliminating reliance on heuristic design. Evaluated on Llama-series models, our method achieves negligible accuracy degradation (<0.3%) under identical activated parameter budgets, while reducing training overhead by 42–68%, significantly outperforming existing pruning and reconstruction approaches.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.
Problem

Research questions and friction points this paper is trying to address.

Enhancing expert diversity in MoE LLM reconstruction
Reducing training overhead in dense-to-MoE conversion
Improving accuracy-efficiency trade-offs in expert activation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-Enhanced MoE reconstruction method
Pruning-based expert reconstruction technique
Efficient retraining of routers and experts
🔎 Similar Papers
No similar papers found.
Y
Yuchen Feng
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
B
Bowen Shen
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
N
Naibin Gu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Jiaxuan Zhao
Jiaxuan Zhao
Xidian University
P
Peng Fu
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Z
Zheng Lin
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Weiping Wang
Weiping Wang
School of Information Science and Engineering, Central South University
Computer NetworkNetwork Security