NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the polysemy of neurons in large language models—where individual neurons respond to multiple semantic concepts—a phenomenon that challenges conventional single-pass interpretation methods. To this end, the paper introduces NeuronScope, the first framework that formulates neuron interpretation as a multi-agent collaborative iterative process. NeuronScope decomposes neuron activations into atomic semantic components through activation-guided disentanglement, clusters these components to identify distinct semantic patterns, and iteratively refines interpretations via activation feedback. This approach explicitly disentangles the internal polysemous structure of neurons, significantly outperforming existing single-pass baselines in activation correlation and effectively uncovering the hidden semantic complexity within individual neurons.

Technology Category

Application Category

📝 Abstract
Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.
Problem

Research questions and friction points this paper is trying to address.

polysemanticity
neuron interpretation
large language models
multi-concept behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

polysemantic neurons
multi-agent framework
neuron interpretation
activation-guided
iterative refinement
🔎 Similar Papers
No similar papers found.
W
Weiqi Liu
Wuhan University
Y
Yongliang Miao
Hong Kong Baptist University
Haiyan Zhao
Haiyan Zhao
Peking University
Y
Yanguang Liu
New Jersey Institute of Technology
Mengnan Du
Mengnan Du
Assistant Professor, New Jersey Institute of Technology
ExplainabilityNatural Language ProcessingTrustworthy AI