🤖 AI Summary
This study investigates whether language models process filler–gap dependencies and negative polarity item (NPI) licensing—two distinct syntactic phenomena—through shared neural mechanisms. Employing causal interpretability methods such as activation patching and distributed alignment search, the work reveals for the first time at a fine-grained level that filler–gap dependencies are supported by localized, shared attention heads and MLP modules, whereas NPI licensing lacks such dedicated circuitry. The analysis further demonstrates that supervised alignment approaches are prone to overfitting, while activation patching exhibits superior generalization. Critically, targeted manipulation of the identified shared components significantly enhances model performance on syntactic acceptability judgment tasks and maintains robustness on out-of-distribution data.
📝 Abstract
While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.