🤖 AI Summary
Existing unsupervised word-level quality estimation (WQE) methods rely on either human annotations or large language models (LLMs), incurring high costs and poor robustness, while struggling with annotation disagreement and label uncertainty. This paper proposes the first fully unsupervised word-level WQE method that explicitly models annotator disagreement as an evaluation benchmark. It quantifies word-level uncertainty using intrinsic signals from translation models—including attention distributions, softmax entropy, and gradient sensitivity—and integrates them within a multi-annotator consistency analysis framework. Crucially, the method requires no human annotations or LLM queries. Evaluated across 12 translation directions and 14 metrics, it significantly outperforms most supervised baselines and demonstrates superior robustness to annotation noise. These results validate the effectiveness and potential of unsupervised paradigms for fine-grained quality estimation.
📝 Abstract
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.