Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of benchmark datasets and effective methods for offensive language identification in code-mixed social media text involving Tulu—a low-resource Dravidian language. We introduce the first Tulu–English code-mixed offensive language dataset, comprising 3,845 YouTube comments with fine-grained annotations, thereby filling a critical gap in NLP research for under-resourced South Indian languages. We propose a novel annotation schema tailored to code-mixed contexts and conduct a systematic evaluation of both conventional architectures (GRU, LSTM, CNN, and self-attention) and multilingual pre-trained models (mBERT, XLM-RoBERTa). Experimental results show that a BiGRU augmented with self-attention achieves the best performance (accuracy: 82%, macro-F1: 0.81), demonstrating the advantage of task-specific architectures for low-resource code-mixed settings. Our dataset, annotation framework, and empirical findings provide a reproducible methodology and benchmark for similar low-resource, code-mixed language scenarios.

Technology Category

Application Category

📝 Abstract
Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.
Problem

Research questions and friction points this paper is trying to address.

Creating first offensive language dataset for Tulu
Evaluating deep learning models for Tulu OLI
Addressing low-resource challenges in code-mixed Tulu
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark dataset for Tulu OLI
BiGRU with self-attention achieves best performance
Evaluates deep learning and transformer models
A
Anusha M D
Department of Computer Science, Yenepoya (Deemed to be University), Balmata, Mangalore, 575002, Karnataka, India.
D
Deepthi Vikram
Department of Computer Science, Yenepoya (Deemed to be University), Balmata, Mangalore, 575002, Karnataka, India.
Bharathi Raja Chakravarthi
Bharathi Raja Chakravarthi
Assistant Professor / Lecturer-Above-the-Bar, School of Computer Science, University of Galway
Natural Language ProcessingUnder-resourced languagesMultimodal Machine LearningHate Speech
P
Parameshwar R Hegde
Department of Computer Science, Yenepoya (Deemed to be University), Balmata, Mangalore, 575002, Karnataka, India.