Fundamental Principles of Linguistic Structure are Not Represented by o3

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies systematic deficiencies in foundational syntactic representation among contemporary large language models (LLMs), including o3-mini-high: they fail to correctly model phrase-structure rules, process recursive hierarchical structures, distinguish grammatical from semantic violations, and perform multi-interpretation semantic evaluation of compositional constructions such as Escher sentences. To rigorously probe these deficits, the authors design and deploy a suite of constructive diagnostic tests—Strawberry Test, Escher sentences, acceptability rating and explanation tasks, and multi-interpretation generation. Empirical evaluation reveals significant and consistent failures across all compositional linguistic tasks—an unprecedented finding. The results demonstrate that deep learning models lack human-level recursive syntactic representation and compositional semantic reasoning capabilities; their compositionality bottleneck constitutes a fundamental limitation, directly challenging the prevailing claim that “large models will supplant formal linguistics.”

Technology Category

Application Category

📝 Abstract
A core component of a successful artificial general intelligence would be the rapid creation and manipulation of grounded compositional abstractions and the demonstration of expertise in the family of recursive hierarchical syntactic objects necessary for the creative use of human language. We evaluated the recently released o3 model (OpenAI; o3-mini-high) and discovered that while it succeeds on some basic linguistic tests relying on linear, surface statistics (e.g., the Strawberry Test), it fails to generalize basic phrase structure rules; it fails with comparative sentences involving semantically illegal cardinality comparisons ('Escher sentences'); its fails to correctly rate and explain acceptability dynamics; and it fails to distinguish between instructions to generate unacceptable semantic vs. unacceptable syntactic outputs. When tasked with generating simple violations of grammatical rules, it is seemingly incapable of representing multiple parses to evaluate against various possible semantic interpretations. In stark contrast to many recent claims that artificial language models are on the verge of replacing the field of linguistics, our results suggest not only that deep learning is hitting a wall with respect to compositionality (Marcus 2022), but that it is hitting [a [stubbornly [resilient wall]]] that cannot readily be surmounted to reach human-like compositional reasoning simply through more compute.
Problem

Research questions and friction points this paper is trying to address.

o3 model fails in linguistic generalization
struggles with syntactic and semantic rules
hits a wall in compositional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated o3 model performance
Tested linguistic generalization capabilities
Identified compositionality limitations in AI
🔎 Similar Papers
No similar papers found.
E
Elliot Murphy
Vivian L. Smith Department of Neurosurgery, UTHealth, Texas, USA; Texas Institute for Restorative Neurotechnologies, UTHealth, Texas, USA
Evelina Leivada
Evelina Leivada
Research Professor at ICREA & Universitat Autònoma de Barcelona
BilingualismLanguage VariationLanguage AcquisitionMorphosyntax
V
Vittoria Dentella
University of Pavia, Pavia, Italy
Fritz Günther
Fritz Günther
Department of Psychology, Humboldt-Universität zu Berlin
semantic memorylanguage modelsconceptual combinationform-meaning mappingvision models
G
Gary Marcus
New York University, New York, USA