🤖 AI Summary
Web-crawled parallel corpora for biomedical English–Polish neural machine translation suffer from low quality and high redundancy, leading to inefficient model training. Method: This study systematically evaluates— for the first time—the impact of three cross-lingual sentence embedding filtering methods—LASER, MUSE, and LaBSE—on fine-tuning mBART50. Contribution/Results: LASER achieves a 70% compression of the UFAL Medical Corpus while significantly improving BLEU (+2.3) on the Khresmoi test set, enhancing translation fluency and naturalness (confirmed by native-speaker preference in human evaluation), and reducing training computational cost by over 60%. This work overcomes the domain-adaptation bottleneck of cross-lingual embedding filtering in low-resource biomedical MT, establishing a reproducible, high-efficiency paradigm for constructing high-quality, compact bilingual datasets.
📝 Abstract
Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.