Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current evaluations of large language models for low- and medium-resource languages such as Icelandic heavily rely on unverified synthetic or machine-translated data, leading to unreliable assessment outcomes. This work presents the first systematic quantitative error analysis of data quality in such evaluation benchmarks, comparing human-authored or human-translated data against automatically generated counterparts across dimensions including linguistic accuracy and task fidelity. The study reveals that unverified data significantly deviates from human-curated references, substantially distorting model performance estimates. These findings underscore the critical importance of human validation in constructing trustworthy evaluation benchmarks and offer methodological guidance for robust assessment practices in low-resource language settings.

Technology Category

Application Category

📝 Abstract

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

low-resource languages

benchmark validity

machine-translated data

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM evaluation

low-resource languages

machine-translated benchmarks