Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Nigeria hosts over 200 languages, yet NLP research has long focused narrowly on a few—primarily Hausa and Yoruba—accounting for less than 1% of linguistic diversity; numerous endangered coastal languages (e.g., Anaang, Efik, Ibibio, Oro) remain excluded from major multilingual benchmarks (Flores-200, SIB-200, Google Translate) due to severe annotation scarcity. Method: We introduce the first bilingual parallel corpora and topic classification datasets for these four Akwa Ibom State languages, enabling the first Nigerian low-resource language extension of Flores-200 and integrating SIB-200’s label taxonomy for cross-lingual text alignment. All data are manually translated and annotated. Contribution/Results: Under zero-shot and few-shot settings, we systematically evaluate LLMs: machine translation remains weak, but topic classification accuracy improves markedly with increasing in-context examples—validating data-efficient modeling. This work establishes the first African low-resource language NLP benchmark and delivers a reproducible methodology for endangered-language technology development.

Technology Category

Application Category

📝 Abstract
Nigeria is the most populous country in Africa with a population of more than 200 million people. More than 500 languages are spoken in Nigeria and it is one of the most linguistically diverse countries in the world. Despite this, natural language processing (NLP) research has mostly focused on the following four languages: Hausa, Igbo, Nigerian-Pidgin, and Yoruba (i.e<1% of the languages spoken in Nigeria). This is in part due to the unavailability of textual data in these languages to train and apply NLP algorithms. In this work, we introduce ibom -- a dataset for machine translation and topic classification in four Coastal Nigerian languages from the Akwa Ibom State region: Anaang, Efik, Ibibio, and Oro. These languages are not represented in Google Translate or in major benchmarks such as Flores-200 or SIB-200. We focus on extending Flores-200 benchmark to these languages, and further align the translated texts with topic labels based on SIB-200 classification dataset. Our evaluation shows that current LLMs perform poorly on machine translation for these languages in both zero-and-few shot settings. However, we find the few-shot samples to steadily improve topic classification with more shots.
Problem

Research questions and friction points this paper is trying to address.

Developing NLP resources for underrepresented Nigerian minority languages
Creating machine translation datasets for four Coastal Nigerian languages
Addressing data scarcity in benchmarks like Flores-200 and SIB-200
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created Ibom dataset for four Nigerian minority languages
Extended Flores-200 benchmark to include these languages
Aligned translated texts with topic classification labels
🔎 Similar Papers
No similar papers found.
O
Oluwadara Kalejaiye
Electrical Engineering and Computer Science, Howard University
L
Luel Hagos Beyene
AIMS Research and Innovation Centre, NM-AIST
David Ifeoluwa Adelani
David Ifeoluwa Adelani
McGill University and Mila - Quebec AI Institute and Canada CIFAR AI Chair
Natural language processingMultilingualityMultilingual NLPAfricaNLPLow-resource NLP
M
Mmekut-Mfon Gabriel Edet
Korapay
A
A. Akpan
National Institute for Nigerian Languages
E
E. Urua
University of Uyo
A
Anietie U Andy
Electrical Engineering and Computer Science, Howard University