🤖 AI Summary
Existing automatic machine translation (MT) evaluation metrics predominantly rely on binary matching between a single hypothesis and the source sentence, whereas human evaluation typically considers multiple candidate translations—leading to systematic assessment bias. To address this, we propose two multi-candidate–enhanced evaluation methods within the COMET framework: COMET-polycand, which employs contrastive learning over multiple reference translations of the same source sentence; and COMET-polyic, which retrieves labeled translations of semantically similar texts to construct contextualized representations. This work is the first to integrate both multi-candidate contrastive learning and retrieval-augmented in-context learning into MT evaluation, enabling context-aware quality estimation. Paragraph-level experiments demonstrate substantial improvements in correlation with human judgments: Kendall’s tau-b scores reach 0.118 and 0.116 for COMET-polycand and COMET-polyic, respectively—significantly surpassing the baseline COMET (0.079). These results empirically validate that leveraging multi-candidate information enhances consistency between automatic metrics and human evaluation.
📝 Abstract
Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.