LightMoE: Reducing Mixture-of-Experts Redundancy through Expert Replacing

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the high memory overhead of Mixture-of-Experts (MoE) large models, which stems from the need to load numerous expert modules, and the limitations of existing compression methods that often incur irreversible performance degradation or substantial training costs. The authors propose a novel expert replacement paradigm, wherein redundant experts are substituted with parameter-efficient modules, and their capabilities are restored through low-cost training, forming the LightMoE framework. LightMoE integrates adaptive expert selection, a hierarchical expert architecture, and an annealing-based recovery strategy to significantly compress model size without suffering the performance drops typical of conventional pruning or merging approaches. Experiments demonstrate that at 30% compression, LightMoE matches the performance of LoRA fine-tuning, and at 50% compression, it achieves an average improvement of 5.6% across five tasks, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) based Large Language Models (LLMs) have demonstrated impressive performance and computational efficiency. However, their deployment is often constrained by substantial memory demands, primarily due to the need to load numerous expert modules. While existing expert compression techniques like pruning or merging attempt to mitigate this, they often suffer from irreversible knowledge loss or high training overhead. In this paper, we propose a novel expert compression paradigm termed expert replacing, which replaces redundant experts with parameter-efficient modules and recovers their capabilities with low training costs. We find that even a straightforward baseline of this paradigm yields promising performance. Building on this foundation, we introduce LightMoE, a framework that enhances the paradigm by introducing adaptive expert selection, hierarchical expert construction, and an annealed recovery strategy. Experimental results show that LightMoE matches the performance of LoRA fine-tuning at a 30% compression ratio. Even under a more aggressive 50% compression rate, it outperforms existing methods and achieves average performance improvements of 5.6% across five diverse tasks. These findings demonstrate that LightMoE strikes a superior balance among memory efficiency, training efficiency, and model performance.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

memory efficiency

expert compression

large language models

model deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

expert replacing

LightMoE

Mixture-of-Experts