🤖 AI Summary
Existing Danish language acceptability evaluation benchmarks suffer from incomplete coverage of error types and weak model discriminability. To address these limitations, we introduce DanLA—the first Danish benchmark grounded in authentic writing errors—comprising 14 sentence corruption functions derived from real-world linguistic errors to systematically generate grammatically ill-formed sentences. Error validity is ensured through a hybrid approach combining expert annotation and automated verification. We evaluate large language models (LLMs) using the language acceptability judgment task. Experimental results demonstrate significant performance degradation across mainstream LLMs on DanLA, confirming its heightened challenge and superior discriminative power. DanLA establishes the first systematic typology of Danish grammatical errors, filling a critical gap in fine-grained evaluation resources for Nordic languages. Moreover, it introduces a novel paradigm for multilingual model capability analysis grounded in empirically observed linguistic phenomena.
📝 Abstract
We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.