A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenge of abusive language detection in low-resource Tigrinya social media. We introduce the first large-scale, human-annotated multitask benchmark dataset—comprising 13,717 YouTube comments—featuring joint annotations for abusiveness, sentiment, and topic, and supporting both Romanized and Ge’ez script variants. Methodologically, we propose a low-resource–oriented multitask labeling framework, an iterative term-clustering strategy for data selection, and a cross-script modeling approach. Experimental results demonstrate that our lightweight multitask model substantially outperforms general-purpose large language models, achieving 86.0% accuracy on abusiveness detection—a +7.0 percentage point improvement. We publicly release both the dataset and strong baseline models, establishing critical infrastructure and a methodological paradigm for online content safety research in low-resource languages.

Technology Category

Application Category

📝 Abstract

Content moderation research has recently made significant advances, but still fails to serve the majority of the world's languages due to the lack of resources, leaving millions of vulnerable users to online hostility. This work presents a large-scale human-annotated multi-task benchmark dataset for abusive language detection in Tigrinya social media with joint annotations for three tasks: abusiveness, sentiment, and topic classification. The dataset comprises 13,717 YouTube comments annotated by nine native speakers, collected from 7,373 videos with a total of over 1.2 billion views across 51 channels. We developed an iterative term clustering approach for effective data selection. Recognizing that around 64% of Tigrinya social media content uses Romanized transliterations rather than native Ge'ez script, our dataset accommodates both writing systems to reflect actual language use. We establish strong baselines across the tasks in the benchmark, while leaving significant challenges for future contributions. Our experiments reveal that small, specialized multi-task models outperform the current frontier models in the low-resource setting, achieving up to 86% accuracy (+7 points) in abusiveness detection. We make the resources publicly available to promote research on online safety.

Problem

Research questions and friction points this paper is trying to address.

Lack of resources for abusive language detection in low-resource languages

Need for multi-task benchmark dataset in Tigrinya social media

Challenges in handling Romanized transliterations in Tigrinya content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multi-task benchmark dataset for Tigrinya

Iterative term clustering for effective data selection

Specialized multi-task models outperform frontier models

🔎 Similar Papers

No similar papers found.