Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models critically depend on hyperparameters such as the total number of experts and the top-k activation count, making manual tuning computationally expensive. To address this, we propose Dynamic Mixture-of-Experts (DynMoE), the first framework enabling token-level adaptive top-k gating and dynamic expert addition/removal during training—eliminating reliance on fixed hyperparameters. Methodologically, DynMoE introduces a gradient-aware dynamic gating network and an expert pruning mechanism, jointly optimizing sparse activation, load balancing, and parameter efficiency. Evaluated across vision, language, and vision-language multimodal tasks, DynMoE matches the performance of GMoE and MoE-LLaVA while significantly reducing activated parameter count and improving both training and inference efficiency.

Technology Category

Application Category

📝 Abstract
The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results.However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.
Problem

Research questions and friction points this paper is trying to address.

Optimizes hyper-parameters in Sparse Mixture of Experts.
Reduces computational overhead in Transformer models.
Enhances efficiency by dynamically adjusting expert activation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Mixture of Experts (DynMoE) technique
Novel gating method for expert activation
Adaptive process for expert number adjustment
🔎 Similar Papers
No similar papers found.
Y
Yongxin Guo
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China; The Shenzhen Institute of Artificial Intelligence and Robotics for Society
Zhenglin Cheng
Zhenglin Cheng
Zhejiang University & Westlake University, SII
Multimodal LearningDiffusion Models
X
Xiaoying Tang
School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, Guangdong, 518172, P.R. China; The Shenzhen Institute of Artificial Intelligence and Robotics for Society; The Guangdong Provincial Key Laboratory of Future Networks of Intelligence
T
Tao Lin
School of Engineering, Westlake University; Research Center for Industries of the Future, Westlake University