NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the polysemy of neurons in large language models—where individual neurons respond to multiple semantic concepts—a phenomenon that challenges conventional single-pass interpretation methods. To this end, the paper introduces NeuronScope, the first framework that formulates neuron interpretation as a multi-agent collaborative iterative process. NeuronScope decomposes neuron activations into atomic semantic components through activation-guided disentanglement, clusters these components to identify distinct semantic patterns, and iteratively refines interpretations via activation feedback. This approach explicitly disentangles the internal polysemous structure of neurons, significantly outperforming existing single-pass baselines in activation correlation and effectively uncovering the hidden semantic complexity within individual neurons.

Technology Category

Application Category

📝 Abstract

Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.

Problem

Research questions and friction points this paper is trying to address.

polysemanticity

neuron interpretation

large language models

multi-concept behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

polysemantic neurons

multi-agent framework

neuron interpretation

activation-guided