Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Machine translation quality estimation (QE) metrics suffer from systematic length bias: they overestimate error rates for high-quality translations of equal length and prefer shorter translations during candidate re-ranking, undermining fairness and effectiveness in reference-free evaluation and QE-guided reinforcement learning. This paper is the first to systematically expose dual length bias—present in both regression-based and large language model–based discriminative QE approaches—and proposes two mitigation strategies: length normalization during training and pseudo-reference injection at inference time. Experiments across ten language pairs demonstrate that our approach significantly reduces length bias, improves score rationality for long translations, and enhances decision quality in both QE-based re-ranking and reinforcement learning. The method establishes a new paradigm for developing length-robust QE metrics.

Technology Category

Application Category

📝 Abstract

Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.

Problem

Research questions and friction points this paper is trying to address.

QE metrics over-predict errors in longer translations

QE metrics favor shorter translations over better candidates

Length bias causes unfair penalization and sub-optimal decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes length normalization during model training

Incorporates reference texts during evaluation process

Mitigates systematic bias in quality estimation metrics

🔎 Similar Papers

No similar papers found.

Authors to Follow