🤖 AI Summary
This work investigates how multiple reference translations enhance literary machine translation quality, specifically for source sentences admitting several semantically valid target-language renderings. Leveraging the Par3 literary parallel corpus, we introduce the first semantic-similarity–based stratification (low/medium/high) of English reference translations and propose a controlled multi-reference training strategy. Using mT5-large and LLaMA-2-7B, we fine-tune models under fixed total sample size; employing only medium- and high-similarity reference variants yields statistically significant improvements over single-reference baselines: +0.3–0.5 BLEU, +0.2–0.9 COMET, and +0.25–0.32 chrF++. Our findings challenge the conventional single-reference paradigm, demonstrating that semantic consistency among references—not mere quantity—is the critical factor enabling multi-reference gains. Code is publicly available.
📝 Abstract
While a source sentence can be translated in many ways, most machine translation (MT) models are trained with only a single reference. Previous work has shown that using synthetic paraphrases can improve MT. This paper investigates best practices for employing multiple references by analyzing the semantic similarity among different English translations of world literature in the Par3 dataset. We classify the semantic similarity between paraphrases into three groups: low, medium, and high, and fine-tune two different LLMs (mT5-large and LLaMA-2-7B) for downstream MT tasks. Across different models, holding the total training instances constant, single-reference but more source texts only marginally outperforms multiple-reference with half of the source texts. Moreover, using paraphrases of medium and high semantic similarity outperforms an unfiltered dataset (+BLEU 0.3-0.5, +COMET 0.2-0.9, +chrF++ 0.25-0.32). Our code is publicly available on GitHub.