NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical absence of a standardized, dual-orthography (Bokmål/Nynorsk) evaluation benchmark for Norwegian generative language models. To this end, we introduce the first comprehensive Norwegian-language evaluation benchmark, comprising 24 human-annotated datasets—including five newly constructed ones—spanning 12 diverse tasks across understanding and generation, fully supporting both Norwegian variants. The benchmark uniquely integrates human baselines and over 100 manually crafted, semantically diverse prompt templates. All data, annotation guidelines, prompt templates, and evaluation code are publicly released. Integrated into the LM Evaluation Harness, the benchmark enables systematic evaluation of 19 open-source Norwegian LMs. This establishes a foundation for fair, reproducible, and multi-standard assessment of Nordic-language AI systems, significantly advancing evaluation rigor and accessibility for Norwegian and related North Germanic languages.

Technology Category

Application Category

📝 Abstract
This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets -- of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokm{aa}l and Nynorsk. All our datasets and a collection of over 100 human-written prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pre-trained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Norwegian language models comprehensively
Covering diverse language understanding and generation tasks
Supporting both Bokmål and Nynorsk written standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive Norwegian language evaluation benchmark
24 datasets with human baselines integration
LM Evaluation Harness for flexible assessment
🔎 Similar Papers
No similar papers found.