Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing unsupervised word-level quality estimation (WQE) methods rely on either human annotations or large language models (LLMs), incurring high costs and poor robustness, while struggling with annotation disagreement and label uncertainty. This paper proposes the first fully unsupervised word-level WQE method that explicitly models annotator disagreement as an evaluation benchmark. It quantifies word-level uncertainty using intrinsic signals from translation models—including attention distributions, softmax entropy, and gradient sensitivity—and integrates them within a multi-annotator consistency analysis framework. Crucially, the method requires no human annotations or LLM queries. Evaluated across 12 translation directions and 14 metrics, it significantly outperforms most supervised baselines and demonstrates superior robustness to annotation noise. These results validate the effectiveness and potential of unsupervised paradigms for fine-grained quality estimation.

Technology Category

Application Category

📝 Abstract

Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

Problem

Research questions and friction points this paper is trying to address.

Unsupervised identification of machine translation errors

Reducing reliance on expensive human-labeled data

Assessing impact of human label variation on metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploiting language model interpretability for WQE

Using uncertainty quantification to identify errors

Evaluating metrics with multiple human labels

🔎 Similar Papers

No similar papers found.