Automatic Correction of Writing Anomalies in Hausa Texts

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hausa text frequently exhibits orthographic anomalies—such as character substitution and spurious or missing whitespace—severely hindering low-resource NLP tasks. To address this, we propose the first end-to-end spelling correction method tailored for Hausa. Our approach comprises: (1) constructing the first large-scale noise-clean parallel corpus (450K sentence pairs); and (2) systematically adapting and fine-tuning multilingual and Africa-specific models (AfriTEVA, M2M100, mBART, Opus-MT), enhanced with SentencePiece tokenization and controllable synthetic noise injection. Experiments demonstrate significant improvements across standard metrics: higher F1, BLEU, and METEOR scores, alongside reduced Character Error Rate (CER) and Word Error Rate (WER). All datasets and trained models are publicly released, establishing foundational resources for Hausa NLP and offering a transferable paradigm for spelling correction in other low-resource languages.

Technology Category

Application Category

📝 Abstract
Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.
Problem

Research questions and friction points this paper is trying to address.

Correcting writing anomalies in Hausa texts
Improving NLP for Hausa with transformer models
Creating a dataset for low-resource language processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned transformer models for anomaly correction
Synthetic noise generation for parallel dataset creation
Adapted multilingual models with SentencePiece tokenization
🔎 Similar Papers
No similar papers found.