BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing Bangla hate speech detection methods critically overlook informal expressions and culturally embedded contexts in regional dialects—particularly those spoken in Barisal, Noakhali, and Chittagong—resulting in suboptimal detection performance and biased content moderation. To address this, we introduce BIDWESH, the first multi-dialectal, multidimensional hate speech dataset for Bangla, comprising 9,183 manually annotated instances covering these three dialects. Our methodology innovatively integrates dialect-specific translation (leveraging the BD-SHS corpus) with a fine-grained, multi-label annotation scheme—covering defamation, gender-based, religious, and incitement-to-violence categories—and incorporates rigorous human verification to ensure linguistic accuracy and contextual coherence. BIDWESH substantially enhances model sensitivity to non-standard lexical forms and culturally nuanced hateful content. It fills a critical gap in low-resource dialectal NLP for fair and precise content moderation, establishing the first high-quality benchmark for dialect-aware hate speech detection.

Technology Category

Application Category

📝 Abstract

Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.

Problem

Research questions and friction points this paper is trying to address.

Detects hate speech in regional Bangla dialects

Addresses gaps in informal dialectal expression datasets

Improves content moderation for low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

First multi-dialectal Bangla hate speech dataset

Manually verified and annotated 9,183 dialect instances

Enables dialect-sensitive NLP tools development

🔎 Similar Papers

No similar papers found.