🤖 AI Summary
Existing Bangla hate speech detection methods critically overlook informal expressions and culturally embedded contexts in regional dialects—particularly those spoken in Barisal, Noakhali, and Chittagong—resulting in suboptimal detection performance and biased content moderation. To address this, we introduce BIDWESH, the first multi-dialectal, multidimensional hate speech dataset for Bangla, comprising 9,183 manually annotated instances covering these three dialects. Our methodology innovatively integrates dialect-specific translation (leveraging the BD-SHS corpus) with a fine-grained, multi-label annotation scheme—covering defamation, gender-based, religious, and incitement-to-violence categories—and incorporates rigorous human verification to ensure linguistic accuracy and contextual coherence. BIDWESH substantially enhances model sensitivity to non-standard lexical forms and culturally nuanced hateful content. It fills a critical gap in low-resource dialectal NLP for fair and precise content moderation, establishing the first high-quality benchmark for dialect-aware hate speech detection.
📝 Abstract
Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.