Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing machine translation benchmark datasets commonly suffer from noise, structural deficiencies, and inconsistent quality, undermining the reliability of multilingual evaluation. This work proposes the first automated quality assessment framework that integrates structured corpus auditing, neural quality metrics (COMET), and fine-grained error analysis powered by large language models (LLMs) to comprehensively diagnose and refine the EU20 benchmark. By evaluating major translation systems—DeepL, Google Translate, and ChatGPT—and analyzing both reference-based and reference-free COMET scores, the study reveals a strong correlation between low COMET scores and high-accuracy errors (e.g., HellaSwag), while ARC emerges as relatively clean. The project releases a cleaned multilingual EU20 dataset, reproducible code, and a practical quality-prioritization guideline, establishing a new paradigm for constructing reliable multilingual benchmarks.

Technology Category

Application Category

📝 Abstract

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

Problem

Research questions and friction points this paper is trying to address.

machine translation

benchmark quality

translation reliability

quality assurance

noisy datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated quality assurance

machine-translated benchmarks

COMET metric

LLM-based error analysis

multilingual benchmarking

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

2024-03-252024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:Citations: 22