🤖 AI Summary
This work addresses the absence of multidimensional evaluation benchmarks for Norwegian large language models (LLMs). We introduce the first high-quality, multicomponent question-answering benchmark covering both Bokmål and Nynorsk variants, integrating world knowledge, commonsense reasoning, factual verification, and Norway-specific knowledge—comprising over 10,000 expert-annotated samples by native speakers. Employing a human-in-the-loop annotation framework and zero-shot/few-shot prompting paradigms, we conduct cross-variant and cross-task consistency evaluations across 11 state-of-the-art LLMs. Results reveal substantial performance degradation on Nynorsk, weakest capabilities in commonsense reasoning, and systematically low answer veracity. This benchmark fills a critical gap in Nordic language evaluation infrastructure; all data, annotations, and protocols are publicly released to support robust assessment and advancement of low-resource language models.
📝 Abstract
This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian - Bokm{aa}l and Nynorsk - our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokm{aa}l than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.