🤖 AI Summary
This study addresses the linguistic barriers impeding the global dissemination of scientific research, which generic machine translation systems struggle to overcome due to their inability to accurately handle domain-specific terminology and complex syntactic structures in scholarly texts. To bridge this gap, the authors present the first systematic construction of Spanish–English, French–English, and Portuguese–English parallel and monolingual corpora spanning four scientific subfields: cancer, energy, neuroscience, and transportation. Leveraging these resources, they perform domain-adaptive fine-tuning of neural machine translation models. Experimental results demonstrate that the fine-tuned systems significantly outperform generic baselines in translation quality for scientific content, thereby confirming the critical role of multilingual, multidisciplinary specialized corpora in enhancing the accuracy and fluency of research literature translation.
📝 Abstract
The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.