🤖 AI Summary
This work addresses the lack of benchmark datasets and effective methods for offensive language identification in code-mixed social media text involving Tulu—a low-resource Dravidian language. We introduce the first Tulu–English code-mixed offensive language dataset, comprising 3,845 YouTube comments with fine-grained annotations, thereby filling a critical gap in NLP research for under-resourced South Indian languages. We propose a novel annotation schema tailored to code-mixed contexts and conduct a systematic evaluation of both conventional architectures (GRU, LSTM, CNN, and self-attention) and multilingual pre-trained models (mBERT, XLM-RoBERTa). Experimental results show that a BiGRU augmented with self-attention achieves the best performance (accuracy: 82%, macro-F1: 0.81), demonstrating the advantage of task-specific architectures for low-resource code-mixed settings. Our dataset, annotation framework, and empirical findings provide a reproducible methodology and benchmark for similar low-resource, code-mixed language scenarios.
📝 Abstract
Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.