ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic text evaluation metrics face two key bottlenecks: reference-based methods exhibit low correlation with human judgments, and large-model-based evaluators fail to replicate high performance on smaller models. To address these issues, we propose ContrastScore—the first reference-free LLM evaluation paradigm grounded in contrastive learning. It explicitly models semantic discrepancies between source and generated texts to mitigate length bias and likelihood bias. Remarkably, even when instantiated with Qwen-0.5B or Qwen-3B, ContrastScore achieves higher human correlation than Qwen-7B—reducing parameter count by over 85%. Across machine translation and summarization tasks, it consistently outperforms both single-model and ensemble baselines. Furthermore, ContrastScore delivers substantial improvements in computational efficiency, robustness to input perturbations, and practical deployability—enabling high-fidelity evaluation at significantly lower resource cost.

Technology Category

Application Category

📝 Abstract
Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
Problem

Research questions and friction points this paper is trying to address.

Improving correlation between automatic and human text evaluation
Reducing biases in text generation assessment metrics
Enhancing efficiency of language model-based evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive evaluation metric for NLG
Mitigates biases in text evaluation
Efficient with smaller model sizes
🔎 Similar Papers
No similar papers found.