🤖 AI Summary
This study investigates how Transformer language models handle syntactic “coordination islands” in English—specifically, the gradient constraints on extraction from coordinated verb phrases. Employing interpretability methods including causal interventions, subspace projection, and large-scale corpus analysis, alongside human acceptability judgments, the work identifies functional subspaces within attention and MLP modules that support filler–gap dependencies. It presents the first application of causal interpretability techniques to the analysis of syntactic island phenomena, demonstrating that models reproduce human-like gradient sensitivity to extraction difficulty in coordination structures. The findings reveal that these dependencies share mechanisms with canonical wh-dependencies but are differentially obstructed. Furthermore, the study proposes a novel linguistic hypothesis: the coordinator “and” is represented distinctly in extractable versus non-extractable configurations.
📝 Abstract
We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., "I know what he hates art and loves" vs. "I know what he looked down and saw"). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction "and" is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.