A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

📅 2025-01-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Web-crawled parallel corpora for biomedical English–Polish neural machine translation suffer from low quality and high redundancy, leading to inefficient model training. Method: This study systematically evaluates— for the first time—the impact of three cross-lingual sentence embedding filtering methods—LASER, MUSE, and LaBSE—on fine-tuning mBART50. Contribution/Results: LASER achieves a 70% compression of the UFAL Medical Corpus while significantly improving BLEU (+2.3) on the Khresmoi test set, enhancing translation fluency and naturalness (confirmed by native-speaker preference in human evaluation), and reducing training computational cost by over 60%. This work overcomes the domain-adaptation bottleneck of cross-lingual embedding filtering in low-resource biomedical MT, establishing a reproducible, high-efficiency paradigm for constructing high-quality, compact bilingual datasets.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.

Problem

Research questions and friction points this paper is trying to address.

Data Cleaning Methods

Biomedical Translation

Machine Translation Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

LASER

MUSE

Biomedical Translation

🔎 Similar Papers

Instruction-tuned Large Language Models for Machine Translation in the Medical Domain