Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the distortion in multilingual large language model evaluation caused by low-quality translation benchmarks, which often suffer from semantic drift and loss of contextual nuance. To mitigate these issues, the authors propose a fully automated translation framework that integrates a universal self-improvement (USI) strategy, a novel multi-round ranking algorithm called T-RANK, and an LLM-as-a-judge evaluation mechanism. This approach generates high-fidelity translations while preserving task structure and linguistic subtleties. The framework is used to construct, for the first time, high-quality benchmark translations for eight underrepresented languages—including Ukrainian and Bulgarian—demonstrating significant improvements over existing resources in both automatic metrics and LLM-based evaluations. The enhanced benchmarks lead to more accurate downstream model assessments, and all tools and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.

Problem

Research questions and friction points this paper is trying to address.

multilingual LLM evaluation

benchmark translation

semantic drift

context loss

translation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

automated translation

benchmark localization

Universal Self-Improvement (USI)

T-RANK

multilingual LLM evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow