Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the lack of benchmark datasets and effective methods for offensive language identification in code-mixed social media text involving Tulu—a low-resource Dravidian language. We introduce the first Tulu–English code-mixed offensive language dataset, comprising 3,845 YouTube comments with fine-grained annotations, thereby filling a critical gap in NLP research for under-resourced South Indian languages. We propose a novel annotation schema tailored to code-mixed contexts and conduct a systematic evaluation of both conventional architectures (GRU, LSTM, CNN, and self-attention) and multilingual pre-trained models (mBERT, XLM-RoBERTa). Experimental results show that a BiGRU augmented with self-attention achieves the best performance (accuracy: 82%, macro-F1: 0.81), demonstrating the advantage of task-specific architectures for low-resource code-mixed settings. Our dataset, annotation framework, and empirical findings provide a reproducible methodology and benchmark for similar low-resource, code-mixed language scenarios.

Technology Category

Application Category

📝 Abstract

Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the limitations of multilingual pretraining in code-mixed, under-resourced contexts. This work lays the foundation for further NLP research in Tulu and similar low-resource, code-mixed languages.

Problem

Research questions and friction points this paper is trying to address.

Creating first offensive language dataset for Tulu

Evaluating deep learning models for Tulu OLI

Addressing low-resource challenges in code-mixed Tulu

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark dataset for Tulu OLI

BiGRU with self-attention achieves best performance

Evaluates deep learning and transformer models

🔎 Similar Papers

Cross-lingual Offensive Language Detection: A Systematic Review of Datasets, Transfer Approaches and Challenges