🤖 AI Summary
This work addresses the polysemy of neurons in large language models—where individual neurons respond to multiple semantic concepts—a phenomenon that challenges conventional single-pass interpretation methods. To this end, the paper introduces NeuronScope, the first framework that formulates neuron interpretation as a multi-agent collaborative iterative process. NeuronScope decomposes neuron activations into atomic semantic components through activation-guided disentanglement, clusters these components to identify distinct semantic patterns, and iteratively refines interpretations via activation feedback. This approach explicitly disentangles the internal polysemous structure of neurons, significantly outperforming existing single-pass baselines in activation correlation and effectively uncovering the hidden semantic complexity within individual neurons.
📝 Abstract
Neuron-level interpretation in large language models (LLMs) is fundamentally challenged by widespread polysemanticity, where individual neurons respond to multiple distinct semantic concepts. Existing single-pass interpretation methods struggle to faithfully capture such multi-concept behavior. In this work, we propose NeuronScope, a multi-agent framework that reformulates neuron interpretation as an iterative, activation-guided process. NeuronScope explicitly deconstructs neuron activations into atomic semantic components, clusters them into distinct semantic modes, and iteratively refines each explanation using neuron activation feedback. Experiments demonstrate that NeuronScope uncovers hidden polysemanticity and produces explanations with significantly higher activation correlation compared to single-pass baselines.