A comparison of data filtering techniques for English-Polish LLM-based machine translation in the biomedical domain

📅 2025-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Web-crawled parallel corpora for biomedical English–Polish neural machine translation suffer from low quality and high redundancy, leading to inefficient model training. Method: This study systematically evaluates— for the first time—the impact of three cross-lingual sentence embedding filtering methods—LASER, MUSE, and LaBSE—on fine-tuning mBART50. Contribution/Results: LASER achieves a 70% compression of the UFAL Medical Corpus while significantly improving BLEU (+2.3) on the Khresmoi test set, enhancing translation fluency and naturalness (confirmed by native-speaker preference in human evaluation), and reducing training computational cost by over 60%. This work overcomes the domain-adaptation bottleneck of cross-lingual embedding filtering in low-resource biomedical MT, establishing a reproducible, high-efficiency paradigm for constructing high-quality, compact bilingual datasets.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have become state-of-the-art in Machine Translation (MT), often trained on massive bilingual parallel corpora scraped from the web, that contain low-quality entries and redundant information, leading to significant computational challenges. Various data filtering methods exist to reduce dataset sizes, but their effectiveness largely varies based on specific language pairs and domains. This paper evaluates the impact of commonly used data filtering techniques, such as LASER, MUSE, and LaBSE, on English-Polish translation within the biomedical domain. By filtering the UFAL Medical Corpus, we created varying dataset sizes to fine-tune the mBART50 model, which was then evaluated using the SacreBLEU metric on the Khresmoi dataset, having the quality of translations assessed by bilingual speakers. Our results show that both LASER and MUSE can significantly reduce dataset sizes while maintaining or even enhancing performance. We recommend the use of LASER, as it consistently outperforms the other methods and provides the most fluent and natural-sounding translations.
Problem

Research questions and friction points this paper is trying to address.

Data Cleaning Methods
Biomedical Translation
Machine Translation Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

LASER
MUSE
Biomedical Translation
🔎 Similar Papers
No similar papers found.
J
Jorge del Pozo Lérida
IT University of Copenhagen
K
Kamil Kojs
IT University of Copenhagen
J
János Máté
IT University of Copenhagen
M
Mikołaj Antoni Baránski
IT University of Copenhagen
Christian Hardmeier
Christian Hardmeier
Associate Professor, IT University of Copenhagen
Natural Language ProcessingMachine TranslationDiscourseBias/Fairness in NLPTranslation Studies