Mixture of Experts Made Intrinsically Interpretable

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Neuron polysemy severely undermines interpretability in large language models. To address this, we propose MoE-X, the first model that deeply couples the Mixture-of-Experts (MoE) architecture with intrinsic interpretability objectives. Specifically, we reparameterize MoE layers into equivalent sparse large MLPs, impose intra-expert L1/L0 regularization to enforce semantic unification of neurons, and design a sparsity-aware Top-k routing mechanism. Crucially, MoE-X achieves neuron-level interpretability *without* post-hoc processing. On both chess-playing and natural language tasks, MoE-X matches dense models in performance while achieving lower perplexity than GPT-2. Quantitative interpretability evaluation demonstrates that MoE-X significantly outperforms state-of-the-art post-hoc methods—including sparse autoencoders (SAEs)—in terms of neuron-level semantic coherence and fidelity.

Technology Category

Application Category

📝 Abstract

Neurons in large language models often exhibit emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present extbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Problem

Research questions and friction points this paper is trying to address.

Address polysemanticity in large language models

Develop intrinsically interpretable Mixture-of-Experts (MoE) model

Enhance interpretability while maintaining model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MoE-X: Intrinsically interpretable Mixture-of-Experts model

Sparse activation for scalable, interpretable language models

Redesigned routing mechanism prioritizes high sparsity experts

🔎 Similar Papers

The FIX Benchmark: Extracting Features Interpretable to eXperts