🤖 AI Summary
This study investigates whether large language models (LLMs) exhibit syntax-category–specific neural representations analogous to those observed in the human brain. Method: Building on Llama 3, we conduct neuron importance analysis, part-of-speech (POS)-conditioned activation profiling, and train linear POS classifiers to systematically identify neuron subsets highly sensitive to grammatical categories (e.g., nouns, verbs). Contribution/Results: We discover, for the first time in LLMs, an interpretable, structured syntactic subspace: distinct neuron groups show stable, linearly separable activation patterns across POS classes. A linear classifier trained on these activations achieves 92.4% accuracy on held-out data—significantly surpassing baseline models. These findings provide neuroscientifically grounded, mechanistic evidence for syntax-aware internal representations in LLMs and reveal functional specialization within the model’s architecture reminiscent of cortical grammar processing in humans.
📝 Abstract
Artificial Neural Networks, the building blocks of AI, were inspired by the human brain's network of neurons. Over the years, these networks have evolved to replicate the complex capabilities of the brain, allowing them to handle tasks such as image and language processing. In the realm of Large Language Models, there has been a keen interest in making the language learning process more akin to that of humans. While neuroscientific research has shown that different grammatical categories are processed by different neurons in the brain, we show that LLMs operate in a similar way. Utilizing Llama 3, we identify the most important neurons associated with the prediction of words belonging to different part-of-speech tags. Using the achieved knowledge, we train a classifier on a dataset, which shows that the activation patterns of these key neurons can reliably predict part-of-speech tags on fresh data. The results suggest the presence of a subspace in LLMs focused on capturing part-of-speech tag concepts, resembling patterns observed in lesion studies of the brain in neuroscience.