Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of annotated datasets for sarcasm detection in low-resource African languages that account for both cultural context and semantic nuance. The authors present Yor-Sarc, the first gold-standard sarcasm detection dataset for Yoruba, comprising 436 instances. They propose a culturally informed multi-annotator protocol that preserves soft labels to model annotation uncertainty and incorporates principles of contextual awareness and community consensus. Experimental results demonstrate high inter-annotator agreement, with Fleiss’ κ at 0.766, 83.3% of samples achieving unanimous consensus, and pairwise Cohen’s κ reaching up to 0.874—surpassing benchmarks reported in several English sarcasm studies. This work establishes a reproducible paradigm for sarcasm detection in other under-resourced African languages.

Technology Category

Application Category

📝 Abstract
Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yorùbá, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yorùbá sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $κ= 0.7660$; pairwise Cohen's $κ= 0.6732$--$0.8743$), with $83.3\%$ unanimous consensus. One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining $16.7\%$ majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarc\footnote{https://github.com/toheebadura/yor-sarc} is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.
Problem

Research questions and friction points this paper is trying to address.

sarcasm detection
low-resource languages
annotated dataset
Yorùbá
computational semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

sarcasm detection
low-resource languages
Yorùbá
gold-standard dataset
culturally informed NLP
🔎 Similar Papers
No similar papers found.
T
Toheeb Aduramomi Jimoh
Department of Computer Science and Information Systems, University of Limerick, Castletroy, V94 T9PX, Limerick, Ireland
T
Tabea De Wille
Department of Computer Science and Information Systems, University of Limerick, Castletroy, V94 T9PX, Limerick, Ireland
Nikola S. Nikolov
Nikola S. Nikolov
Associate Professor, Department of Computer Science and Information Systems, University of Limerick
Machine LearningNLPGraph Drawing