🤖 AI Summary
To address strong parameter coupling, slow convergence, and insufficient expert specialization caused by Softmax gating in Hierarchical Mixture of Experts (HMoE) models, this paper proposes a two-level Laplace gating mechanism—introducing Laplace-distributed gating functions uniformly at both the top- and bottom-level expert selection stages of HMoE. This design theoretically decouples parameter update paths across experts, thereby accelerating convergence and enhancing expert specialization. Theoretical analysis demonstrates superior gradient propagation properties compared to Softmax-based gating. Extensive experiments on multimodal understanding, image classification, and latent domain discovery tasks consistently outperform the Softmax-HMoE baseline, validating the proposed method’s effectiveness in modeling complex inputs and adapting to diverse downstream tasks.
📝 Abstract
With the growing prominence of the Mixture of Experts (MoE) architecture in developing large-scale foundation models, we investigate the Hierarchical Mixture of Experts (HMoE), a specialized variant of MoE that excels in handling complex inputs and improving performance on targeted tasks. Our analysis highlights the advantages of using the Laplace gating function over the traditional Softmax gating within the HMoE frameworks. We theoretically demonstrate that applying the Laplace gating function at both levels of the HMoE model helps eliminate undesirable parameter interactions caused by the Softmax gating and, therefore, accelerates the expert convergence as well as enhances the expert specialization. Empirical validation across diverse scenarios supports these theoretical claims. This includes large-scale multimodal tasks, image classification, and latent domain discovery and prediction tasks, where our modified HMoE models show great performance improvements compared to the conventional HMoE models.