🤖 AI Summary
This study addresses the detection of female-targeted abusive text—including hate speech, derogatory language, and threats—in Tamil and Malayalam social media, marking the first systematic, gender-sensitive abusive language identification effort for low-resource Dravidian languages. We propose a two-stage approach integrating logistic regression with fine-tuned multilingual BERT, trained and evaluated cross-lingually on the DravidianLangTech@2025 annotated dataset. Experimental results show that the BERT-based model achieves macro-F1 scores of 0.729 on the Tamil test set and 0.628 on the Malayalam test set—substantially outperforming baseline methods. This work fills a critical research gap in content safety for South Indian languages, specifically in detecting gendered abuse. It provides a reproducible methodological framework and benchmark results for gender-inclusive NLP in low-resource settings.
📝 Abstract
The increasing misuse of social media has become a concern; however, technological solutions are being developed to moderate its content effectively. This paper focuses on detecting abusive texts targeting women on social media platforms. Abusive speech refers to communication intended to harm or incite hatred against vulnerable individuals or groups. Specifically, this study aims to identify abusive language directed toward women. To achieve this, we utilized logistic regression and BERT as base models to train datasets sourced from DravidianLangTech@2025 for Tamil and Malayalam languages. The models were evaluated on test datasets, resulting in a 0.729 macro F1 score for BERT and 0.6279 for logistic regression in Tamil and Malayalam, respectively.