GaelEval: Benchmarking LLM Performance for Scottish Gaelic

๐Ÿ“… 2026-04-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the lack of systematic evaluation of large language models (LLMs) on morphosyntactically rich low-resource languages such as Scottish Gaelic, where conventional translation benchmarks inadequately capture structural linguistic competence. The authors propose GaelEval, the first multidimensional evaluation benchmark for Gaelic, comprising expert-crafted morphosyntactic multiple-choice questions, culturally contextualized translation tasks, and large-scale cultural knowledge question answering. They establish a human baseline using 30 native speakers and evaluate 19 prominent LLMs. Results show that Gemini 3 Pro Preview achieves 83.3% accuracy on grammatical tasks, significantly outperforming the human baseline (78.1%). Closed-source models generally surpass open-source counterparts, while Gaelic-language prompting improves performance on some tasks but degrades it on cultural ones. This work establishes the first comprehensive evaluation framework for Gaelic, revealing both the current limitations and untapped potential of LLMs in low-resource linguistic settings.
๐Ÿ“ Abstract
Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.
Problem

Research questions and friction points this paper is trying to address.

Scottish Gaelic
large language models
morphosyntax
cultural knowledge
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

GaelEval
Scottish Gaelic
multilingual LLMs
morphosyntactic evaluation
cultural knowledge benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.