The Gold Medals in an Empty Room: Diagnosing Metalinguistic Reasoning in LLMs with Camlang

📅 2025-08-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

It remains unclear whether large language models’ (LLMs) high benchmark scores reflect genuine metalinguistic reasoning or merely superficial pattern matching. Method: We introduce Camlang—a constructed language featuring natural yet unseen combinations of linguistic features—and the Camlang-CSQA-v0 task, which simulates adult second-language acquisition via explicit grammar specifications and bilingual dictionaries to rigorously assess systematic mastery of novel syntactic systems. Contribution/Results: This work establishes the first cognitively grounded evaluation paradigm for metalinguistic reasoning, enabling fine-grained error attribution across morphology/syntax, lexical semantics, and sentence-level inference. Experiments reveal that GPT-5 achieves 98% accuracy on English CSQA but only 47% on Camlang—substantially below human performance (87%). Successful predictions predominantly rely on shallow lexical alignment, exposing a critical deficit in rule internalization. Our framework provides both a novel benchmark and a methodological advance for probing LLMs’ compositional and systematic linguistic capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) achieve gold-medal performance across many benchmarks, yet it remains unclear whether such success reflects genuine reasoning or pattern matching. From a cognitive science perspective, an informative test is whether models can master an unfamiliar language through explicit metalinguistic deductive learning, a paradigm where human learners can reliably internalise grammatical systems through metalinguistic reasoning. We address this question with Camlang, a novel constructed language that exhibits naturalistic yet unattested feature combinations. Camlang consists of two explicit resources, a grammar book and a bilingual dictionary, which mirror adult second-language learning via explicit grammar rules and lexical lookup, and enable us to disentangle errors in morpho-syntax, lexical semantics, and sentence-level reasoning. Human experiments show that these resources are sufficient for participants to acquire Camlang and successfully solve Camlang tasks. To operationalise evaluation, we adapt CommonsenseQA into Camlang, creating Camlang-CSQA-v0, the first task in a broader suite where solving questions requires applying grammar rules and lexical mappings. Experimental results show that GPT-5 achieves 98% EM accuracy in English but only 47% in Camlang, far below human performance at 87%, while other state-of-the-art reasoning LLMs perform even worse. Human verification further reveals that most model successes stem from shallow lexical alignment while GPT-5 shows emerging metalinguistic awareness to a limited extent but not systematic grammatical mastery as humans. Camlang establishes a cognitively grounded evaluation paradigm that exposes fundamental gaps between current models and human metalinguistic competence.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' genuine reasoning versus pattern matching abilities

Testing metalinguistic deductive learning in unfamiliar constructed languages

Evaluating systematic grammatical mastery through explicit rule application

Innovation

Methods, ideas, or system contributions that make the work stand out.

Camlang constructed language for evaluation

Grammar book and bilingual dictionary resources

Adapted CommonsenseQA into Camlang-CSQA-v0

🔎 Similar Papers

No similar papers found.

Authors to Follow