🤖 AI Summary
To address expert redundancy and insufficient diversity in Mixture-of-Experts (MoE) model reconstruction, this paper proposes a calibration-data-driven three-stage reconstruction paradigm: (1) leveraging multi-domain calibration data to quantify domain affinity and measure expert diversity; (2) performing structured pruning and expert recombination at the FFN-module level; and (3) applying parameter-efficient fine-tuning to routers and normalization layers for modular, lightweight retraining. This work is the first to explicitly use calibration data to guide expert differentiation, eliminating reliance on heuristic design. Evaluated on Llama-series models, our method achieves negligible accuracy degradation (<0.3%) under identical activated parameter budgets, while reducing training overhead by 42–68%, significantly outperforming existing pruning and reconstruction approaches.
📝 Abstract
Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters.