π€ AI Summary
This work challenges the prevailing assumption that high-level, interpretable concepts in language models must be extracted via sparse autoencoders, demonstrating instead that MLP neurons themselves naturally form a sparse and semantically interpretable basis without additional training. The authors provide the first empirical evidence that individual MLP neurons exhibit inherent sparsity and interpretability, enabling their direct use as fundamental units in circuit analysis. Building on this insight, they propose an end-to-end neuron-level circuit tracing pipeline that integrates gradient-based attribution with sparsity analysis to precisely identify causally influential neurons in multi-hop reasoning tasks. Experiments reveal that model behavior in subject-verb agreement tasks is dominated by only about 100 MLP neurons, and in βcity β state β capitalβ reasoning chains, specific neuron subsets encoding each inference step can be accurately identified and manipulated to achieve targeted control over model outputs.
π Abstract
The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.