A Closer Look into Mixture-of-Experts in Large Language Models

📅 2024-06-26

🏛️ North American Chapter of the Association for Computational Linguistics

📈 Citations: 14

✨ Influential: 0

career value

232K/year

🤖 AI Summary

The internal mechanisms and modular nature of Mixture-of-Experts (MoE) large language models remain poorly understood, particularly regarding expert granularity, routing behavior, and layer-wise expert diversity. Method: We conduct attribution analysis, expert activation visualization, output norm statistics, and controlled experiments across three representative MoE architectures—Mixtral, GLaM, and DeepSpeed-MoE. Contribution/Results: We empirically establish that individual neurons function as fine-grained experts; routers exhibit strong preference for high-norm experts; and expert diversity generally increases with network depth—except in the final layer. Based on these findings, we formulate a hierarchical evolution law of expert diversity and provide actionable guidelines for router design and expert allocation. Our work formally validates the modular architecture of MoE models, identifies anomalous behavior in the top layer, and has directly informed routing strategy improvements across multiple research teams. The open-sourced code has garnered significant community attention.

Technology Category

Application Category

📝 Abstract

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three popular MoE-based models and reveal some intriguing observations, including 1) Neurons act like fine-grained experts; 2) The router of MoE usually selects experts with larger output norms; 3) The expert diversity increases as the layer increases, while the last layer is an outlier, which is further validated by an initial experiment. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

Problem

Research questions and friction points this paper is trying to address.

Understanding inner workings of MoE-based large language models

Exploring parametric and behavioral features of MoE models

Investigating expert diversity and router selection mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse activation for computational efficiency

Analyzing neuron behavior as fine-grained experts

Router selects experts with larger output norms

🔎 Similar Papers

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts