Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the distortion in multilingual large language model evaluation caused by low-quality translation benchmarks, which often suffer from semantic drift and loss of contextual nuance. To mitigate these issues, the authors propose a fully automated translation framework that integrates a universal self-improvement (USI) strategy, a novel multi-round ranking algorithm called T-RANK, and an LLM-as-a-judge evaluation mechanism. This approach generates high-fidelity translations while preserving task structure and linguistic subtleties. The framework is used to construct, for the first time, high-quality benchmark translations for eight underrepresented languages—including Ukrainian and Bulgarian—demonstrating significant improvements over existing resources in both automatic metrics and LLM-based evaluations. The enhanced benchmarks lead to more accurate downstream model assessments, and all tools and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
Problem

Research questions and friction points this paper is trying to address.

multilingual LLM evaluation
benchmark translation
semantic drift
context loss
translation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

automated translation
benchmark localization
Universal Self-Improvement (USI)
T-RANK
multilingual LLM evaluation
🔎 Similar Papers
No similar papers found.