🤖 AI Summary
Machine translation quality estimation (QE) metrics suffer from systematic length bias: they overestimate error rates for high-quality translations of equal length and prefer shorter translations during candidate re-ranking, undermining fairness and effectiveness in reference-free evaluation and QE-guided reinforcement learning. This paper is the first to systematically expose dual length bias—present in both regression-based and large language model–based discriminative QE approaches—and proposes two mitigation strategies: length normalization during training and pseudo-reference injection at inference time. Experiments across ten language pairs demonstrate that our approach significantly reduces length bias, improves score rationality for long translations, and enhances decision quality in both QE-based re-ranking and reinforcement learning. The method establishes a new paradigm for developing length-robust QE metrics.
📝 Abstract
Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.