Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses domain adaptation in machine translation for the low-resource language Guinea-Bissau Creole (Kriol), specifically tackling the challenge of generalizing from religious-domain pretraining to broader, general-purpose domains. We propose a lightweight domain adaptation method within the Transformer architecture: mixing and fine-tuning on a small, highly domain-relevant target-set—only 300 sentences—and introduce the first open-source Kriol parallel corpus, comprising 40,000 sentence pairs. We conduct the first systematic evaluation of multilingual translation performance across English↔Kriol and Portuguese↔Kriol directions, revealing that Portuguese→Kriol significantly outperforms English→Kriol due to greater morphological similarity and lexical overlap. Experimental results show that injecting this minimal, domain-matched data yields a +4.2 BLEU improvement, demonstrating the critical efficacy of targeted, ultra-low-resource adaptation. Our study provides both a reproducible methodology and foundational data resources for practical MT deployment in severely under-resourced languages.

Technology Category

Application Category

📝 Abstract

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah's Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of machine translation data for Guinea-Bissau Creole

Improving domain transfer from religious to general text translation

Investigating performance differences in Portuguese-to-Kiriol translation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

New dataset for Kiriol machine translation

Transformer models improve domain transfer

Portuguese-to-Kiriol models perform best

🔎 Similar Papers

Modeling the Sacred: Considerations when Using Considerations when Using Religious Texts in Natural Language Processing