🤖 AI Summary
In high-quality machine translation (MT) human evaluation, performance gains are often obscured by annotation noise. To address this, we propose a two-stage MQM-based collaborative re-annotation method: building upon initial annotations, it introduces iterative human review and collaborative refinement to jointly optimize primary annotations, peer annotations, and automatic predictions. This work is the first to integrate collaborative editing behavior modeling into the MQM framework. It significantly improves annotation consistency (+18.3%) and error detection rate (+24.7%), effectively recovering errors missed in the first round. Experiments demonstrate that re-annotation substantially enhances the reliability and stability of evaluation outcomes, enabling more accurate reflection of model quality improvements in assessment scores. The proposed approach establishes a scalable, reproducible paradigm for high-precision MT evaluation.
📝 Abstract
Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.