AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses sentiment analysis for low-resource African languages in social media text. We propose a two-stage continual pretraining paradigm—domain-adaptive pretraining (DAPT) followed by task-adaptive pretraining (TAPT)—built upon XLM-R to jointly adapt both domain and task knowledge. Our key contributions are: (1) the construction of AfriSocial, the first high-quality pretraining corpus specifically curated for African languages in the social media domain; (2) the first systematic empirical validation of DAPT+TAPT for fine-grained sentiment classification across 16 African languages; and (3) achieving a macro-F1 score of 28.27%, substantially outperforming strong baselines by 1.0 percentage point, with further gains of 0.55–15.11% when incorporating unlabeled sentiment data. All datasets, models, and code are publicly released to support downstream applications including sentiment analysis and hate speech detection.

Technology Category

Application Category

📝 Abstract
Pretrained Language Models (PLMs) built from various sources are the foundation of today's NLP progress. Language representations learned by such models achieve strong performance across many tasks with datasets of varying sizes drawn from various sources. We explore a thorough analysis of domain and task adaptive continual pretraining approaches for low-resource African languages and a promising result is shown for the evaluated tasks. We create AfriSocial, a corpus designed for domain adaptive finetuning that passes through quality pre-processing steps. Continual pretraining PLMs using AfriSocial as domain adaptive pretraining (DAPT) data, consistently improves performance on fine-grained emotion classification task of 16 targeted languages from 1% to 28.27% macro F1 score. Likewise, using the task adaptive pertaining (TAPT) approach, further finetuning with small unlabeled but similar task data shows promising results. For example, unlabeled sentiment data (source) for fine-grained emotion classification task (target) improves the base model results by an F1 score ranging from 0.55% to 15.11%. Combining the two methods, DAPT + TAPT, achieves also better results than base models. All the resources will be available to improve low-resource NLP tasks, generally, as well as other similar domain tasks such as hate speech and sentiment tasks.
Problem

Research questions and friction points this paper is trying to address.

Adapting PLMs for African languages social media text
Improving emotion classification for low-resource African languages
Enhancing NLP tasks with domain and task adaptive pretraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts PLMs for African social media text
Uses domain adaptive pretraining (DAPT)
Combines DAPT with task adaptive pretraining (TAPT)
🔎 Similar Papers
No similar papers found.