Danoliteracy of Generative, Large Language Models

📅 2024-10-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of reliable evaluation benchmarks for Danish large language models (LLMs) hinders rigorous assessment of their linguistic and cultural competence. Method: We introduce Danoliteracy, the first generative evaluation benchmark for Danish, covering eight culturally grounded language use scenarios—including citizenship exams and social media Q&A—designed via expert annotation, multi-task formulation, and correlation analysis. Contribution/Results: We formally define “Danoliteracy” as a latent, domain-general proficiency factor (g-factor) that explains 95% of cross-task performance variance; even a compact version of the benchmark achieves high correlation with human judgments (ρ ≈ 0.8). Experiments show GPT-4 and Claude Opus achieve top performance, and Danoliteracy effectively discriminates model capability tiers. This work fills a critical gap in Nordic language evaluation and provides a reproducible methodological framework for assessing LLMs in low-resource languages.

Technology Category

Application Category

📝 Abstract
The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $ ho sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Danish language competency in GLLMs
Lack of evaluation corpora for low-resource languages
Identifying consistency factors in language model adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed GLLM benchmark for Danish evaluation
Assessed Danoliteracy across eight scenarios
Identified strong consistency factor in performance
🔎 Similar Papers
No similar papers found.
S
Søren Vejlgaard Holm
Technical University of Denmark, Anker Engelunds Vej 1, 2800 Kongens Lyngby, Denmark; Alvenir, Applebys Plads 7, 1411 København K, Denmark
Lars Kai Hansen
Lars Kai Hansen
Professor, Cognitive Systems, DTU Compute, Technical University of Denmark
Machine learningAIneuroimagingcognitive systemssignal processing
M
Martin Carsten Nielsen
Alvenir, Applebys Plads 7, 1411 København K, Denmark