DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

To address expert redundancy and insufficient diversity in Mixture-of-Experts (MoE) model reconstruction, this paper proposes a calibration-data-driven three-stage reconstruction paradigm: (1) leveraging multi-domain calibration data to quantify domain affinity and measure expert diversity; (2) performing structured pruning and expert recombination at the FFN-module level; and (3) applying parameter-efficient fine-tuning to routers and normalization layers for modular, lightweight retraining. This work is the first to explicitly use calibration data to guide expert differentiation, eliminating reliance on heuristic design. Evaluated on Llama-series models, our method achieves negligible accuracy degradation (<0.3%) under identical activated parameter budgets, while reducing training overhead by 42–68%, significantly outperforming existing pruning and reconstruction approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.

Problem

Research questions and friction points this paper is trying to address.

Enhancing expert diversity in MoE LLM reconstruction

Reducing training overhead in dense-to-MoE conversion

Improving accuracy-efficiency trade-offs in expert activation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity-Enhanced MoE reconstruction method

Pruning-based expert reconstruction technique

Efficient retraining of routers and experts

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts