HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sentiment analysis for low-resource African languages—particularly Hausa—is hindered by a severe scarcity of labeled data. Method: This study constructs the first benchmark dataset of 5,000 Hausa and English–Hausa mixed online movie reviews, manually annotated with high inter-annotator agreement (Fleiss’ Kappa = 0.85). We systematically evaluate traditional machine learning models (logistic regression, decision trees, KNN) against fine-tuned Transformer-based models (BERT, RoBERTa), incorporating domain-informed feature engineering. Contribution/Results: A feature-engineered decision tree achieves state-of-the-art performance—89.72% accuracy and 89.60% F1—significantly outperforming all deep learning baselines. This challenges the prevailing assumption that Transformer models inherently dominate in low-resource NLP tasks and advocates a new paradigm: lightweight, interpretable models augmented with linguistically grounded, domain-adapted features. The dataset and best-performing model establish the strongest publicly available baseline for Hausa sentiment analysis to date.

Technology Category

Application Category

📝 Abstract
The development of Natural Language Processing (NLP) tools for low-resource languages is critically hindered by the scarcity of annotated datasets. This paper addresses this fundamental challenge by introducing HausaMovieReview, a novel benchmark dataset comprising 5,000 YouTube comments in Hausa and code-switched English. The dataset was meticulously annotated by three independent annotators, demonstrating a robust agreement with a Fleiss' Kappa score of 0.85 between annotators. We used this dataset to conduct a comparative analysis of classical models (Logistic Regression, Decision Tree, K-Nearest Neighbors) and fine-tuned transformer models (BERT and RoBERTa). Our results reveal a key finding: the Decision Tree classifier, with an accuracy and F1-score 89.72% and 89.60% respectively, significantly outperformed the deep learning models. Our findings also provide a robust baseline, demonstrating that effective feature engineering can enable classical models to achieve state-of-the-art performance in low-resource contexts, thereby laying a solid foundation for future research. Keywords: Hausa, Kannywood, Low-Resource Languages, NLP, Sentiment Analysis
Problem

Research questions and friction points this paper is trying to address.

Addressing annotated dataset scarcity for low-resource African languages
Introducing a benchmark Hausa sentiment analysis dataset from YouTube comments
Comparing classical and transformer models' performance on this dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created HausaMovieReview dataset with 5000 annotated comments
Compared classical models and fine-tuned transformer models
Decision Tree outperformed deep learning with 89.72% accuracy
🔎 Similar Papers
No similar papers found.
A
Asiya Ibrahim Zanga
Department of Computer Science, Federal University Dutsin-Ma, Katsina, Nigeria
S
Salisu Mamman Abdulrahman
Department of Computer Science, Aliko Dangote University of Science and Technology, Wudil, Kano, Nigeria
A
Abubakar Ado
Department of Computer Science, Northwest University, Kano, Nigeria
A
Abdulkadir Abubakar Bichi
Department of Computer Science, Northwest University, Kano, Nigeria
L
Lukman Aliyu Jibril
A
Abdulmajid Babangida Umar
Department of Computer Science, Northwest University, Kano, Nigeria
A
Alhassan Adamu
Department of Computer Science, Aliko Dangote University of Science and Technology, Wudil, Kano, Nigeria
Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
Bayero University, Kano, & Google DeepMind Academic Fellow at Imperial College London
Natural Language ProcessingSentiment AnalysisAfricaNLPLow-resource NLPMultilinguality
Bashir Salisu Abubakar
Bashir Salisu Abubakar
Kano University of Science and Technology, wudil
Natural Language ProcessingText summarization