Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing machine translation evaluation methods struggle to capture cross-sentence semantic coherence, particularly in long-document translation, where single-sentence intrinsic quality scores often overlook contextual semantic consistency. To address this, we propose TREQA—a novel extrinsic evaluation framework that leverages reading comprehension question-answering as a proxy task. TREQA employs large language models to automatically generate question-answer pairs from source texts and assesses whether translations accurately preserve key source information within their broader context. The method requires no human annotation, offering both interpretability and semantic grounding. Experiments demonstrate that TREQA achieves performance on par with or superior to state-of-the-art neural and LLM-based metrics—especially on long-range understanding tasks such as literary translation—and precisely identifies translation errors annotated by human experts.

Technology Category

Application Category

📝 Abstract
Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa
Problem

Research questions and friction points this paper is trying to address.

Evaluating paragraph-level translation meaning preservation
Assessing key information accuracy in complex translations
Improving interpretability of translation errors via QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses QA to evaluate translation quality
Competes with neural and LLM metrics
Provides interpretable QA for error analysis
🔎 Similar Papers
No similar papers found.
Patrick Fernandes
Patrick Fernandes
Carnegie Mellon University & Instituto Superior Técnico
NLPMachine Learning
Sweta Agrawal
Sweta Agrawal
Research Scientist at Google
Machine TranslationNatural Language Generation and Evaluation
E
Emmanouil Zaranis
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa
A
Andr'e F.T. Martins
Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, Unbabel
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence