🤖 AI Summary
The absence of reliable evaluation benchmarks for Danish large language models (LLMs) hinders rigorous assessment of their linguistic and cultural competence.
Method: We introduce Danoliteracy, the first generative evaluation benchmark for Danish, covering eight culturally grounded language use scenarios—including citizenship exams and social media Q&A—designed via expert annotation, multi-task formulation, and correlation analysis.
Contribution/Results: We formally define “Danoliteracy” as a latent, domain-general proficiency factor (g-factor) that explains 95% of cross-task performance variance; even a compact version of the benchmark achieves high correlation with human judgments (ρ ≈ 0.8). Experiments show GPT-4 and Claude Opus achieve top performance, and Danoliteracy effectively discriminates model capability tiers. This work fills a critical gap in Nordic language evaluation and provides a reproducible methodological framework for assessing LLMs in low-resource languages.
📝 Abstract
The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $
ho sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.