Danoliteracy of Generative, Large Language Models

📅 2024-10-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

The absence of reliable evaluation benchmarks for Danish large language models (LLMs) hinders rigorous assessment of their linguistic and cultural competence. Method: We introduce Danoliteracy, the first generative evaluation benchmark for Danish, covering eight culturally grounded language use scenarios—including citizenship exams and social media Q&A—designed via expert annotation, multi-task formulation, and correlation analysis. Contribution/Results: We formally define “Danoliteracy” as a latent, domain-general proficiency factor (g-factor) that explains 95% of cross-task performance variance; even a compact version of the benchmark achieves high correlation with human judgments (ρ ≈ 0.8). Experiments show GPT-4 and Claude Opus achieve top performance, and Danoliteracy effectively discriminates model capability tiers. This work fills a critical gap in Nordic language evaluation and provides a reproducible methodological framework for assessing LLMs in low-resource languages.

Technology Category

Application Category

📝 Abstract

The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $ ho sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Danish language competency in GLLMs

Lack of evaluation corpora for low-resource languages

Identifying consistency factors in language model adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed GLLM benchmark for Danish evaluation

Assessed Danoliteracy across eight scenarios

Identified strong consistency factor in performance

🔎 Similar Papers

No similar papers found.