ParsCN: A Persian Dataset for Counter-Narrative Generation to Combat Online Hate Speech

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical scarcity of high-quality counter-narrative data for low-resource languages—exemplified by Persian—which severely hinders research on automated counter-narrative generation against online hate speech. To bridge this gap, the authors introduce ParsCN, the first comprehensive Persian counter-narrative dataset comprising 1,100 annotated pairs, and propose a scalable, multi-stage framework that integrates culturally informed human annotation, semantic retrieval, few-shot large language model generation (e.g., GPT-4o, Claude), and rigorous human validation to enable cost-effective, high-fidelity data construction. Experimental results demonstrate that human-authored counter-narratives achieve the highest scores in relevance (4.23), effectiveness (4.21), fluency (4.92), and tonal appropriateness (4.79), while automatically generated responses exhibit strong semantic alignment, lexical diversity, and low toxicity. This paradigm is readily generalizable to other low-resource languages and exposes persistent limitations in current models regarding cultural adaptation and safety.
📝 Abstract
Online hate speech threatens online civility, particularly in low-resource and multilingual environments. Counter-narratives offer a promising solution by promoting constructive responses to hate speech. However, automatic counter-narrative generation is hindered by the lack of high-quality data for low-resource languages like Persian. To bridge this gap, we introduce ParsCN, the first and most comprehensive Persian counter-narrative dataset. Consisting of 1,100 hate speech and counter-narrative pairs, it provides fine-grained annotations across six target groups and six countering strategies, tailored to the socio-cultural context of Persian online discourse. We propose a novel, scalable multi-stage framework that integrates culturally-informed human annotation with few-shot LLM-augmented generation, guided by semantic retrieval and rigorous manual curation. This approach enables the creation of diverse, high-quality counter-narratives while significantly reducing annotation costs - establishing a replicable paradigm for other low-resource settings. Comprehensive human and automatic evaluations confirm the quality of the dataset and the effectiveness of the generated responses. Human-written counter-narratives achieved the highest scores for relevance (4.23), Effectiveness (4.21), fluency (4.92), and tone appropriateness (4.79), with GPT-4o and Claude closely following. Automatic evaluations show strong semantic alignment, high lexical diversity, and low toxicity across all sources. Finally, we conduct benchmark evaluations using mBART and PersianMind on a held-out test set. Results reveal that existing models struggle with fluency, cultural nuance, and safety - highlighting the need for Persian-specific resources like ParsCN. Our dataset serves as a foundational benchmark to advance research on Persian counter-narrative generation and foster safer, more inclusive digital spaces.
Problem

Research questions and friction points this paper is trying to address.

hate speech
counter-narrative generation
low-resource languages
Persian dataset
online civility
Innovation

Methods, ideas, or system contributions that make the work stand out.

counter-narrative generation
low-resource languages
culturally-informed annotation
few-shot LLM augmentation
semantic retrieval
🔎 Similar Papers
No similar papers found.