COMET-poly: Machine Translation Metric Grounded in Other Candidates

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing automatic machine translation (MT) evaluation metrics predominantly rely on binary matching between a single hypothesis and the source sentence, whereas human evaluation typically considers multiple candidate translations—leading to systematic assessment bias. To address this, we propose two multi-candidate–enhanced evaluation methods within the COMET framework: COMET-polycand, which employs contrastive learning over multiple reference translations of the same source sentence; and COMET-polyic, which retrieves labeled translations of semantically similar texts to construct contextualized representations. This work is the first to integrate both multi-candidate contrastive learning and retrieval-augmented in-context learning into MT evaluation, enabling context-aware quality estimation. Paragraph-level experiments demonstrate substantial improvements in correlation with human judgments: Kendall’s tau-b scores reach 0.118 and 0.116 for COMET-polycand and COMET-polyic, respectively—significantly surpassing the baseline COMET (0.079). These results empirically validate that leveraging multi-candidate information enhances consistency between automatic metrics and human evaluation.

Technology Category

Application Category

📝 Abstract

Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.

Problem

Research questions and friction points this paper is trying to address.

Automated metrics ignore multiple translation alternatives like humans use

Current metrics only consider source and single translation, limiting accuracy

Proposing new metrics incorporating additional translations for better assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

COMET-polycand uses alternative translations for comparison

COMET-polyic incorporates retrieved examples with scores

Both metrics improve correlation with human judgments

🔎 Similar Papers

No similar papers found.

ServiceNow

Mountain View, CALIFORNIA, US

Sr. MLE, GAI Search Relevance - JB0069884

Moveworks

*Our compensation package includes a market competitive salary, equity for all full time roles, exceptional benefits, and, for applicable roles, commissions or bonus plans. Ultimately, in determining pay, final offers may vary from the amount listed based on geography, the role’s scope and complexity, the candidate’s experience and expertise, and other factors.

San Diego, California

Authors to Follow