🤖 AI Summary
This paper addresses the interpretability challenge of large language models (LLMs) by proposing the “concept neuron” paradigm—a novel framework for efficiently identifying neurons that encode specific semantic concepts. Unlike conventional gradient- or perturbation-based attribution methods—whose computational complexity is O(nm)—our approach introduces a concept vector–driven attribution framework requiring only O(n) forward passes for neuron screening, drastically reducing computational overhead. The method integrates concept vector construction, activation clustering, and ablation-based validation to enable precise identification and targeted intervention for harmful concepts (e.g., bias or hate speech). Evaluated across multiple benchmarks, it outperforms state-of-the-art baselines and demonstrates strong robustness and practical utility in assessing contextual bias within Indian-language settings.
📝 Abstract
Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.