🤖 AI Summary
Existing large language models (LLMs) lack rigorous evaluation of syntactic competence—particularly morphosyntactic judgment—in low-resource languages. Method: We introduce MorphoGram, the first multilingual syntactic benchmark covering 101 languages and six core grammatical phenomena, comprising 125,000 minimal pairs. Built upon Universal Dependencies and UniMorph, MorphoGram employs an automated, rule-driven morphological generation pipeline for scalable, cross-lingual construction. Contribution/Results: Our systematic evaluation reveals a pronounced performance drop in syntactic judgment for mainstream LLMs on low-resource languages, demonstrating strong resource dependency. MorphoGram fills a critical gap in fine-grained, multilingual grammatical assessment—spanning over one hundred languages—and establishes a reproducible, extensible evaluation paradigm with empirical grounding for low-resource language modeling.
📝 Abstract
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages, 6 linguistic phenomena and containing more than 125,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.