COMET-poly: Machine Translation Metric Grounded in Other Candidates

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic machine translation (MT) evaluation metrics predominantly rely on binary matching between a single hypothesis and the source sentence, whereas human evaluation typically considers multiple candidate translations—leading to systematic assessment bias. To address this, we propose two multi-candidate–enhanced evaluation methods within the COMET framework: COMET-polycand, which employs contrastive learning over multiple reference translations of the same source sentence; and COMET-polyic, which retrieves labeled translations of semantically similar texts to construct contextualized representations. This work is the first to integrate both multi-candidate contrastive learning and retrieval-augmented in-context learning into MT evaluation, enabling context-aware quality estimation. Paragraph-level experiments demonstrate substantial improvements in correlation with human judgments: Kendall’s tau-b scores reach 0.118 and 0.116 for COMET-polycand and COMET-polyic, respectively—significantly surpassing the baseline COMET (0.079). These results empirically validate that leveraging multi-candidate information enhances consistency between automatic metrics and human evaluation.

Technology Category

Application Category

📝 Abstract
Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.
Problem

Research questions and friction points this paper is trying to address.

Automated metrics ignore multiple translation alternatives like humans use
Current metrics only consider source and single translation, limiting accuracy
Proposing new metrics incorporating additional translations for better assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

COMET-polycand uses alternative translations for comparison
COMET-polyic incorporates retrieved examples with scores
Both metrics improve correlation with human judgments
🔎 Similar Papers
No similar papers found.