🤖 AI Summary
Current research on large language models with Mixture-of-Experts (MoE) architectures lacks sufficient interpretability regarding expert behavior, particularly a systematic understanding of expert activation mechanisms and their causal roles. This work proposes distinguishing between “domain experts,” which specialize in specific content areas, and “driver experts,” which exert stronger influence over model outputs. We introduce an entropy-based metric to identify domain-preferring experts for the first time and combine it with causal intervention analysis to quantify each expert’s causal effect on predictions. Experiments reveal that driver experts are more readily activated by initial tokens in a sentence. By selectively adjusting the weights of these two expert types, we achieve significant performance gains across three publicly available MoE models and domains, thereby enhancing both the interpretability and controllability of MoE systems.
📝 Abstract
Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.