🤖 AI Summary
This study addresses the lack of systematic evaluation of cross-linguistic metalinguistic knowledge—defined as explicit reasoning about language structure—in large language models, particularly for low-resource and globally diverse languages. The authors introduce and release the first multilingual benchmark for metalinguistic knowledge, spanning lexical, syntactic, and phonological dimensions, evaluated using accuracy and macro-F1 scores alongside majority-class and random baselines. Performance is further analyzed in relation to language resource availability, such as Wikipedia size. Experiments reveal that GPT-4o achieves the highest accuracy (0.367) among tested models, yet no model surpasses the majority-class baseline. Critically, model performance strongly correlates with the digital resource availability of each language, exposing a fundamental limitation: current models possess fragmented metalinguistic knowledge heavily dependent on data visibility rather than robust structural understanding.
📝 Abstract
LLMs are routinely evaluated on language use, yet their explicit knowledge about linguistic structure remains poorly understood. Existing linguistic benchmarks focus on narrow phenomena, emphasize high-resource languages, and rarely test metalinguistic knowledge - explicit reasoning about language structure. We present a multilingual evaluation of metalinguistic knowledge in LLMs, based on the World Atlas of Language Structures (WALS), documenting 192 linguistic features across 2,660 languages. We convert WALS features into natural-language multiple-choice questions and evaluate models across documented languages. Using accuracy and macro F1, and comparing to chance and majority-class baselines, we assess performance and analyse variation across linguistic domains and language-related factors. Results show limited metalinguistic knowledge: GPT-4o performs best but achieves moderate accuracy (0.367), while open-source models lag. Although all models perform above chance, they fail to outperform the majority-class baseline, suggesting they capture broad cross-linguistic patterns but lack fine-grained distinctions. Performance varies by domain, partly reflecting differences in online visibility. At the language level, accuracy correlates with digital language status: languages with greater digital presence and resources are evaluated more accurately, while low-resource languages perform worse. Analysis of predictive factors confirms that resource-related indicators (Wikipedia size, corpus availability) are more informative than geographic, genealogical, or sociolinguistic factors. Overall, LLM metalinguistic knowledge appears fragmented and shaped mainly by data availability, rather than broadly generalizable grammatical competence. We release the benchmark as an open-source dataset to support evaluation across languages and encourage greater global linguistic diversity in future LLMs.