🤖 AI Summary
This study investigates the impact of cultural background on Persian toxic language detection, emphasizing the critical role of cultural similarity and dissimilarity in cross-lingual transfer. We systematically evaluate fine-tuning, zero-shot/few-shot prompting, cross-lingual pretraining (mBERT, XLM-R), and culture-aware corpus filtering on a Persian toxic speech dataset. Our experiments reveal, for the first time, that transfer performance correlates inversely with cultural distance between source and target languages: culturally proximate Arabic yields a 12.4% F1 improvement over culturally distant English. These findings demonstrate that cultural factors constitute a fundamental, previously underappreciated variable in NLP-based toxicity detection. The work establishes a culture-sensitive paradigm for cross-lingual toxicity modeling, advancing both theoretical understanding and practical deployment in linguistically and culturally diverse settings.
📝 Abstract
Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country. Warning: This paper contains examples of toxic language that may disturb some readers. These examples are included for the purpose of research on toxic detection.