Monet: Mixture of Monosemantic Experts for Transformers

📅 2024-12-05
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core challenge of neuronal polysemy—where individual neurons in large language models (LLMs) respond to multiple unrelated semantic concepts—hindering mechanistic interpretability and safety alignment. We propose an end-to-end interpretable Mixture-of-Experts (MoE) pretraining framework that intrinsically enforces monosemanticity. Our method embeds sparse dictionary learning directly into the MoE architecture, inducing semantically exclusive representations within expert modules during pretraining. We further introduce an expert decomposition mechanism enabling over 260K experts per layer, with parameter growth scaling only sublinearly (i.e., √N) in the number of experts. This design ensures inter-expert knowledge exclusivity, precise concept localization, and online editability. Experiments demonstrate substantial improvement in neuron-level semantic disentanglement, enabling accurate cross-lingual and cross-domain concept localization as well as fine-grained suppression of toxic content—without compromising general-purpose model performance. Code and checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.
Problem

Research questions and friction points this paper is trying to address.

Address polysemanticity in large language models
Enhance mechanistic interpretability without performance loss
Enable knowledge manipulation across domains and languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monet integrates sparse dictionary learning into Mixture-of-Experts pretraining.
Scales expert count to 262,144 per layer efficiently.
Enables domain, language, and toxicity knowledge manipulation.
🔎 Similar Papers
No similar papers found.