Who Wrote This? Identifying Machine vs Human-Generated Text in Hausa

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address misinformation and plagiarism risks arising from AI-generated text abuse in low-resource languages like Hausa, this paper introduces the first large-scale human-machine text discrimination system for Hausa. Methodologically: (1) we construct the first high-quality, manually annotated Hausa dataset comprising authentic articles crawled from seven major Hausa-language media outlets, paired with counterfactual texts generated by Gemini-2.0 Flash; (2) we systematically evaluate and fine-tune AfriTeVa, AfriBERTa, and multiple variants of AfroXLMR. Results show AfroXLMR achieves 99.23% accuracy and 99.21% F1-score—substantially outperforming general-purpose baselines—demonstrating the superiority of Africa-centric pre-trained models for AI-text detection in low-resource settings. All datasets and code are publicly released, establishing critical infrastructure and a methodological paradigm for AI content governance in low-resource languages.

Technology Category

Application Category

📝 Abstract
The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.
Problem

Research questions and friction points this paper is trying to address.

Detecting machine-generated text in Hausa language.
Addressing lack of detectors for low-resource languages.
Evaluating Afri-centric models for Hausa text detection.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed first Hausa text detector
Fine-tuned Afri-centric language models
Achieved 99.23% accuracy with AfroXLMR
🔎 Similar Papers
No similar papers found.