🤖 AI Summary
This study investigates how large language models (LLMs) differentially process syntactic complexity—such as coordination versus subordination, center-embedding, and ambiguity resolution—through their internal representations. It introduces intrinsic dimensionality (ID) as a novel metric to quantify linguistic complexity and combines representational similarity analysis, layer ablation, and cross-model comparisons to track ID dynamics across model layers. The findings demonstrate that ID effectively distinguishes between formal and functional aspects of syntactic complexity, with multi-clause constructions eliciting significantly higher ID values. Moreover, distinct types of syntactic complexity activate specific stages of abstract processing, a pattern consistently observed across multiple state-of-the-art LLMs.
📝 Abstract
We explore the intrinsic dimension (ID) of LLM representations as a marker of linguistic complexity, asking if different ID profiles across LLM layers differentially characterize formal and functional complexity. We find the formal contrast between sentences with multiple coordinated or subordinated clauses to be reflected in ID differences whose onset aligns with a phase of more abstract linguistic processing independently identified in earlier work. The functional contrasts between sentences characterized by right branching vs. center embedding or unambiguous vs. ambiguous relative clause attachment are also picked up by ID, but in a less marked way, and they do not correlate with the same processing phase. Further experiments using representational similarity and layer ablation confirm the same trends. We conclude that ID is a useful marker of linguistic complexity in LLMs, that it allows to differentiate between different types of complexity, and that it points to similar stages of linguistic processing across disparate LLMs.