ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the low-resource named entity recognition (NER) task for three regional Bangla dialects—Sylheti, Chittagonian, and Barisali—by introducing the first high-quality, dialect-balanced benchmark dataset. It comprises 10,443 sentences annotated in BIO format with 10 entity types, including dialect-specific categories (e.g., food, animals, colors), curated by native-speaking linguists. Methodologically, the dataset integrates diverse sources (public corpora and crawled news articles), employs cross-regional stratified sampling, and adopts standardized CSV formatting for reproducibility. Key contributions are: (1) the first systematic, geographically grounded NER benchmark for Bangla dialects; (2) the inclusion of dialectally salient entity types; and (3) the release of an open-source, fully documented, and reproducible resource. Empirical evaluation demonstrates substantial improvements in dialectal NER model training and evaluation, enabling downstream low-resource NLP applications such as machine translation, information retrieval, and dialogue systems.

Technology Category

Application Category

📝 Abstract

ANCHOLIK-NER is a linguistically diverse dataset for Named Entity Recognition (NER) in Bangla regional dialects, capturing variations across Sylhet, Chittagong, and Barishal. The dataset has around 10,443 sentences, 3,481 sentences per region. The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles. To ensure high-quality annotations, the BIO tagging scheme was employed, and professional annotators with expertise in regional dialects carried out the labeling process. The dataset is structured into separate subsets for each region and is available both in CSV format. Each entry contains textual data along with identified named entities and their corresponding annotations. Named entities are categorized into ten distinct classes: Person, Location, Organization, Food, Animal, Colour, Role, Relation, Object, and Miscellaneous. This dataset serves as a valuable resource for developing and evaluating NER models for Bangla dialectal variations, contributing to regional language processing and low-resource NLP applications. It can be utilized to enhance NER systems in Bangla dialects, improve regional language understanding, and support applications in machine translation, information retrieval, and conversational AI.

Problem

Research questions and friction points this paper is trying to address.

Benchmark dataset for Bangla NER.

Captures regional dialect variations.

Supports low-resource NLP applications.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bangla regional NER dataset

BIO tagging for annotation

Ten distinct entity classes

🔎 Similar Papers

No similar papers found.