🤖 AI Summary
To address the high cost, expert burden, and bias susceptibility of human evaluation in machine translation, this paper proposes ESA^AI—an AI-assisted Error Span Annotation protocol. Methodologically, it introduces a novel recall-oriented automatic quality estimation pre-filling mechanism that precisely localizes error spans and automatically filters high-confidence correct segments; it further designs a human-AI collaborative annotation protocol integrating bias mitigation and confidence calibration strategies. Experimental results demonstrate that ESA^AI achieves annotation quality comparable to conventional ESA (identical score consistency), while reducing per-error-span annotation time by 56% (from 71s to 31s) and cutting total human effort by 24.8%. Moreover, automated bias is significantly alleviated, jointly enhancing evaluation efficiency and fairness.
📝 Abstract
Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires a lot of expert human labor. In the recently adopted annotation protocol, Error Span Annotation (ESA), annotators mark erroneous parts of the translation and then assign a final score. A lot of the annotator time is spent on scanning the translation for possible errors. In our work, we help the annotators by pre-filling the error annotations with recall-oriented automatic quality estimation. With this AI assistance, we obtain annotations at the same quality level while cutting down the time per span annotation by half (71s/error span $
ightarrow$ 31s/error span). The biggest advantage of the ESA$^mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score. This alleviates a potential automation bias, which we confirm to be low. In our experiments, we find that the annotation budget can be further reduced by almost 25% with filtering of examples that the AI deems to be likely to be correct.