AI-Assisted Human Evaluation of Machine Translation

📅 2024-06-18

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the high cost, expert burden, and bias susceptibility of human evaluation in machine translation, this paper proposes ESA^AI—an AI-assisted Error Span Annotation protocol. Methodologically, it introduces a novel recall-oriented automatic quality estimation pre-filling mechanism that precisely localizes error spans and automatically filters high-confidence correct segments; it further designs a human-AI collaborative annotation protocol integrating bias mitigation and confidence calibration strategies. Experimental results demonstrate that ESA^AI achieves annotation quality comparable to conventional ESA (identical score consistency), while reducing per-error-span annotation time by 56% (from 71s to 31s) and cutting total human effort by 24.8%. Moreover, automated bias is significantly alleviated, jointly enhancing evaluation efficiency and fairness.

Technology Category

Application Category

📝 Abstract

Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol, Error Span Annotation (ESA), annotators mark erroneous parts of the translation and then assign a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span $ ightarrow$ 31s/error span). The biggest advantage of the ESA$^mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This alleviates a potential automation bias, which we confirm to be low. In our experiments, we find that the annotation budget can be further reduced by almost 25% with filtering of examples that the AI deems to be likely to be correct.

Problem

Research questions and friction points this paper is trying to address.

Machine Translation Evaluation

Expert Bias

Efficiency Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-assisted Error Span Annotation

Machine Translation Quality Assessment

Subjectivity Bias Reduction

🔎 Similar Papers

No similar papers found.