๐ค AI Summary
This study addresses the problem of inadequate translation quality in non-English commonsense reasoning evaluation, particularly for low-resource languages. To this end, we construct the first high-quality Estonian adaptation of the WinoGrande benchmark. Our method introduces a linguistically informed translation adaptation framework that integrates linguistic features with commonsense reasoning requirements, employing professional human translation and customized prompt engineering. We systematically evaluate both human-translated and machine-translated (via leading open- and closed-source LLMs) versions across multiple LLMs. Results show that models achieve near-native English performance on human-translated data, whereas machine translation incurs a substantial average drop of โ12.3%, which prompt engineering fails to mitigate. This confirms the indispensable role of domain-expert linguists in ensuring multilingual benchmark reliability. Our work establishes a reusable methodology and practical standard for evaluating LLMs in low-resource languages.
๐ Abstract
In this paper, we present a localized and culturally adapted Estonian translation of the test set from the widely used commonsense reasoning benchmark, WinoGrande. We detail the translation and adaptation process carried out by translation specialists and evaluate the performance of both proprietary and open source models on the human translated benchmark. Additionally, we explore the feasibility of achieving high-quality machine translation by incorporating insights from the manual translation process into the design of a detailed prompt. This prompt is specifically tailored to address both the linguistic characteristics of Estonian and the unique translation challenges posed by the WinoGrande dataset. Our findings show that model performance on the human translated Estonian dataset is slightly lower than on the original English test set, while performance on machine-translated data is notably worse. Additionally, our experiments indicate that prompt engineering offers limited improvement in translation quality or model accuracy, and highlight the importance of involving language specialists in dataset translation and adaptation to ensure reliable and interpretable evaluations of language competency and reasoning in large language models.