Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the excessive memory and communication overhead in traditional Mixture-of-Experts (MoE) models—caused by scaling the number of experts in large language models—this paper proposes the Mixture of Latent Experts (MoLE). MoLE maps experts into a shared low-dimensional latent space, enabling factorized decomposition of expert weights. Its core innovations include: (i) the first introduction of latent-space expert sharing, (ii) theoretical conditions guaranteeing convertibility from pretrained MoE to MoLE, and (iii) a two-stage structural reparameterization algorithm for conversion. Experiments demonstrate that MoLE preserves representational capacity and language modeling performance while substantially reducing parameter count (up to 72%), GPU memory consumption (up to 68%), and inter-expert communication overhead. Consequently, both training and inference efficiency are significantly improved, offering a practical and scalable pathway for deploying large-scale MoE models.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token. Nevertheless, conventional MoE architectures encounter substantial challenges, including excessive memory utilization and communication overhead during training and inference, primarily attributable to the proliferation of expert modules. In this paper, we introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. Specifically, all expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations with significantly reduced parametric complexity. This factorized approach substantially diminishes parameter count and computational requirements. Beyond the pretraining implementation of the MoLE architecture, we also establish a rigorous mathematical framework for transforming pre-trained MoE models into the MoLE architecture, characterizing the sufficient conditions for optimal factorization and developing a systematic two-phase algorithm for this conversion process. Our comprehensive theoretical analysis demonstrates that MoLE significantly enhances computational efficiency across multiple dimensions while preserving model representational capacity. Empirical evaluations corroborate our theoretical findings, confirming that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory and communication overhead in MoE models
Introduces latent space to lower expert parameter complexity
Converts pre-trained MoE to MoLE without losing performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

MoLE maps experts into shared latent space
Decomposes expert operations into two components
Reduces parameters and computational requirements
🔎 Similar Papers
No similar papers found.
Z
Zehua Liu
Huawei Noah’s Ark Lab
H
Han Wu
Huawei Noah’s Ark Lab
R
Ruifeng She
Huawei Noah’s Ark Lab
X
Xiaojin Fu
Huawei Noah’s Ark Lab
Xiongwei Han
Xiongwei Han
AI&OR Principal Researcher at Noah's Ark Lab, Huawei
Intelligence ModelingLLMs for OR
T
Tao Zhong
Huawei Noah’s Ark Lab
M
Mingxuan Yuan
Huawei Noah’s Ark Lab