Estonian WinoGrande Dataset: Comparative Analysis of LLM Performance on Human and Machine Translation

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This study addresses the problem of inadequate translation quality in non-English commonsense reasoning evaluation, particularly for low-resource languages. To this end, we construct the first high-quality Estonian adaptation of the WinoGrande benchmark. Our method introduces a linguistically informed translation adaptation framework that integrates linguistic features with commonsense reasoning requirements, employing professional human translation and customized prompt engineering. We systematically evaluate both human-translated and machine-translated (via leading open- and closed-source LLMs) versions across multiple LLMs. Results show that models achieve near-native English performance on human-translated data, whereas machine translation incurs a substantial average drop of −12.3%, which prompt engineering fails to mitigate. This confirms the indispensable role of domain-expert linguists in ensuring multilingual benchmark reliability. Our work establishes a reusable methodology and practical standard for evaluating LLMs in low-resource languages.

Technology Category

Application Category

📝 Abstract

In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.

Problem

Research questions and friction points this paper is trying to address.

Creating culturally adapted Estonian translation of WinoGrande benchmark dataset

Evaluating LLM performance differences between human and machine translations

Assessing prompt engineering effectiveness for improving translation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human translation by specialists for cultural adaptation

Detailed prompt engineering for machine translation

Comparative analysis of human and machine translation performance

🔎 Similar Papers

No similar papers found.