SAFR: Neuron Redistribution for Interpretability

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic ambiguity and poor interpretability in Transformer neurons caused by feature superposition, this paper proposes a semantics-decoupling neuron regularization method. Specifically, it identifies salient tokens and their semantically related pairs using VMASK and attention weights, then differentially modulates the semantic granularity of MLP-layer neurons based on token importance and semantic relevance. A hybrid L1/L2 regularization term is introduced to encourage unambiguous (single-meaning) neuron activation for important tokens and shared ambiguous (multi-meaning) neuron activation for semantically related token pairs. This approach achieves, for the first time, decoupled optimization of interpretability and model performance: on two classification tasks, it preserves original accuracy while improving neuron monosemanticity by 37%. Moreover, it enables fine-grained neuron allocation visualization and interactive interpretation.

Technology Category

Application Category

📝 Abstract
Superposition refers to encoding representations of multiple features within a single neuron, which is common in transformers. This property allows neurons to combine and represent multiple features, enabling the model to capture intricate information and handle complex tasks. Despite promising performance, the model's interpretability has been diminished. This paper presents a novel approach to enhance transformer interpretability by regularizing feature superposition. We introduce SAFR, which simply applies regularizations to the loss function to promote monosemantic representations for important tokens while encouraging polysemanticity for correlated token pairs, where important tokens and correlated token pairs are identified via VMASK and attention weights. With a transformer model on two classification tasks, SAFR improves interpretability without compromising prediction performance. Given an input to the model, SAFR provides an explanation by visualizing the neuron allocation and interaction within the MLP layers.
Problem

Research questions and friction points this paper is trying to address.

Neural Network Models
Transformer Models
Interpretability Issues
Innovation

Methods, ideas, or system contributions that make the work stand out.

SAFR
Neural Rearrangement
Model Interpretability
🔎 Similar Papers
No similar papers found.
Ruidi Chang
Ruidi Chang
Rice University
Natural Language ProcessingMachine Learning Interpretability
C
Chunyuan Deng
Department of Computer Science, Rice University
H
Hanjie Chen
Department of Computer Science, Rice University