🤖 AI Summary
This study addresses safety risks arising from poor neuron-level interpretability in large language models (LLMs), focusing on how semantic concepts are encoded within the residual stream. We propose a scalable, minute-scale semantic mapping method that leverages the LM-head’s projection layer to automatically annotate *all* up-projection neurons—achieving full-neuron labeling for Llama 3.1 8B in under 15 minutes. We find that over 75% of neurons exhibit highly consistent dominant token preferences between pre-trained and instruction-tuned variants, revealing substantial semantic stability across fine-tuning. Furthermore, we introduce neuron clamping—a causal intervention strategy—that enables targeted content control (e.g., modulating “dog”-associated neurons to steer generation). Our contributions are threefold: (1) an efficient, scalable framework for whole-neuron semantic decoding; (2) empirical evidence of neuron-level semantic robustness under instruction tuning; and (3) a direct, causally grounded pathway for neuron-level model steering, establishing a novel paradigm for trustworthy LLM deployment.
📝 Abstract
Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. Such black-box models can pose a significant risk to safety when trusted to make important decisions. The lack of interpretability of LLMs is more related to their sheer size, rather than the complexity of their individual components. The TARS method for knowledge removal (Davies et al 2024) provides strong evidence for the hypothesis that that linear layer weights which act directly on the residual stream may have high correlation with different concepts encoded in the residual stream. Building upon this, we attempt to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). Firstly, we show that with Llama 3.1 8B we can utilise the LM-head to decode specialised feature neurons that respond strongly to certain concepts, with examples such as"dog"and"California". This is then confirmed by demonstrating that these neurons can be clamped to affect the probability of the concept in the output. This extends to the fine-tuned assistant Llama 3.1 8B instruct model, where we find that over 75% of neurons in the up-projection layers have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the"dog"neuron leads the instruct model to always discuss dogs when asked about its favourite animal. Through our method, it is possible to map the entirety of Llama 3.1 8B's up-projection neurons in less than 15 minutes with no parallelization.