Building a Functional Machine Translation Corpus for Kpelle

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Public parallel corpora for low-resource Mande languages—such as Kpelle—are severely lacking, hindering NLP development. Method: We construct the first open-source English–Kpelle bilingual corpus (2,000+ sentence pairs), covering daily life, religious, and educational domains. To ensure linguistic rigor, we establish the first standardized orthography for Kpelle; introduce a community-driven validation framework; and propose Mande-specific data augmentation strategies. Translation models are built via fine-tuning NLLB-200, multi-source text alignment, and rigorous human verification. Contribution/Results: Our Kpelle→English MT system achieves a BLEU score of 30.0—substantially outperforming baselines. The corpus supports downstream tasks including ASR and language modeling, with performance on par with the NLLB-200 African language benchmark. This work delivers a reproducible data curation pipeline and methodological framework for low-resource West African language NLP.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce the first publicly available English-Kpelle dataset for machine translation, comprising over 2000 sentence pairs drawn from everyday communication, religious texts, and educational materials. By fine-tuning Meta's No Language Left Behind(NLLB) model on two versions of the dataset, we achieved BLEU scores of up to 30 in the Kpelle-to-English direction, demonstrating the benefits of data augmentation. Our findings align with NLLB-200 benchmarks on other African languages, underscoring Kpelle's potential for competitive performance despite its low-resource status. Beyond machine translation, this dataset enables broader NLP tasks, including speech recognition and language modelling. We conclude with a roadmap for future dataset expansion, emphasizing orthographic consistency, community-driven validation, and interdisciplinary collaboration to advance inclusive language technology development for Kpelle and other low-resourced Mande languages.

Problem

Research questions and friction points this paper is trying to address.

Create first English-Kpelle dataset for machine translation

Improve translation performance for low-resource Kpelle language

Enable broader NLP tasks beyond machine translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

First English-Kpelle MT dataset with 2000 sentences

Fine-tuned NLLB model using data augmentation

Achieved 30 BLEU score for Kpelle-English translation

🔎 Similar Papers

No similar papers found.

Apple

San Francisco Bay Area, United States of America

Software Engineer - Machine Translation Automation

Apple

Seattle, United States of America

Research Scientist Intern, Multimodal AI (PhD)