YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of existing Yoruba named entity recognition (NER) research, which suffers from narrow domain coverage and insufficient data. To overcome this, the authors introduce YoNER, the first multi-domain NER dataset for Yoruba, spanning five domains—Bible, blogs, movies, broadcast transcripts, and Wikipedia—with approximately 5,000 sentences and 100,000 tokens. The dataset was annotated by three native speakers following CoNLL guidelines, and the paper also presents OyoBERT, a dedicated language model for Yoruba. Through systematic evaluation using Transformer-based architectures under settings including manual annotation, cross-domain transfer learning, few-shot learning, and cross-lingual pretraining, experiments demonstrate that Africa-centric models outperform general multilingual baselines, transfer is most effective between linguistically similar domains, and OyoBERT achieves state-of-the-art performance on in-domain NER tasks, significantly advancing NLP for low-resource African languages.
📝 Abstract
Named Entity Recognition (NER) is a foundational NLP task, yet research in Yorùbá has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yorùbá NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yorùbá speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yorùbá, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yorùbá-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yorùbá natural language processing.
Problem

Research questions and friction points this paper is trying to address.

Named Entity Recognition
Yorùbá
low-resource languages
multi-domain dataset
NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

Yorùbá NER
multidomain dataset
OyoBERT
cross-domain evaluation
low-resource NLP
🔎 Similar Papers
No similar papers found.
P
Peace Busola Falola
University of Ibadan, Nigeria
J
Jesujoba O. Alabi
Saarland University, Germany
S
Solomon O. Akinola
University of Ibadan, Nigeria
F
Folashade T. Ogunajo
Atiba University, Nigeria
E
Emmanuel Oluwadunsin Alabi
University of Ibadan, Nigeria
David Ifeoluwa Adelani
David Ifeoluwa Adelani
McGill University and Mila - Quebec AI Institute and Canada CIFAR AI Chair
Natural language processingMultilingualityMultilingual NLPAfricaNLPLow-resource NLP