HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Sentiment analysis for low-resource African languages—particularly Hausa—is hindered by a severe scarcity of labeled data. Method: This study constructs the first benchmark dataset of 5,000 Hausa and English–Hausa mixed online movie reviews, manually annotated with high inter-annotator agreement (Fleiss’ Kappa = 0.85). We systematically evaluate traditional machine learning models (logistic regression, decision trees, KNN) against fine-tuned Transformer-based models (BERT, RoBERTa), incorporating domain-informed feature engineering. Contribution/Results: A feature-engineered decision tree achieves state-of-the-art performance—89.72% accuracy and 89.60% F1—significantly outperforming all deep learning baselines. This challenges the prevailing assumption that Transformer models inherently dominate in low-resource NLP tasks and advocates a new paradigm: lightweight, interpretable models augmented with linguistically grounded, domain-adapted features. The dataset and best-performing model establish the strongest publicly available baseline for Hausa sentiment analysis to date.

Technology Category

Application Category

📝 Abstract

The development of Natural Language Processing (NLP) tools for low-resource languages is critically hindered by the scarcity of annotated datasets. This paper addresses this fundamental challenge by introducing HausaMovieReview, a novel benchmark dataset comprising 5,000 YouTube comments in Hausa and code-switched English. The dataset was meticulously annotated by three independent annotators, demonstrating a robust agreement with a Fleiss' Kappa score of 0.85 between annotators. We used this dataset to conduct a comparative analysis of classical models (Logistic Regression, Decision Tree, K-Nearest Neighbors) and fine-tuned transformer models (BERT and RoBERTa). Our results reveal a key finding: the Decision Tree classifier, with an accuracy and F1-score 89.72% and 89.60% respectively, significantly outperformed the deep learning models. Our findings also provide a robust baseline, demonstrating that effective feature engineering can enable classical models to achieve state-of-the-art performance in low-resource contexts, thereby laying a solid foundation for future research. Keywords: Hausa, Kannywood, Low-Resource Languages, NLP, Sentiment Analysis

Problem

Research questions and friction points this paper is trying to address.

Addressing annotated dataset scarcity for low-resource African languages

Introducing a benchmark Hausa sentiment analysis dataset from YouTube comments

Comparing classical and transformer models' performance on this dataset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created HausaMovieReview dataset with 5000 annotated comments

Compared classical models and fine-tuned transformer models

Decision Tree outperformed deep learning with 89.72% accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow