Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study investigates the impact of integrating large language models (LLMs) into machine translation pipelines on the effectiveness of two quality prediction paradigms: source-side difficulty prediction and candidate translation quality estimation. Leveraging a human post-editing dataset comprising over 6,000 English sentences and nine translation hypotheses—including both conventional neural machine translation systems and state-of-the-art LLMs—the work presents the first systematic evaluation of how LLMs and traditional MT approaches influence the reliability of quality prediction in realistic multi-candidate scenarios. Using Kendall’s rank correlation and combining source-side features, candidate quality models, and positional heuristics, performance is assessed against TER and COMET as gold-standard metrics. The results demonstrate that LLMs substantially alter the behavior of traditional prediction methods and alleviate longstanding challenges in document-level translation quality estimation.

Technology Category

Application Category

📝 Abstract

This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.

Problem

Research questions and friction points this paper is trying to address.

machine translation quality prediction

Large Language Models

hindsight evaluation

post-editing effort

quality estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

hindsight quality prediction

large language models

multi-candidate MTPE