BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing Bangla Natural Language Inference (NLI) datasets suffer from annotation errors, semantic ambiguity, and insufficient linguistic diversity, hindering robust inference research for low-resource languages. To address these limitations, we introduce BNLI—the first linguistically curated, high-quality Bangla NLI dataset—constructed via a hybrid annotation pipeline integrating multi-stage human verification and fine-grained semantic analysis. This process ensures label consistency, balanced class distribution across entailment, neutral, and contradiction relations, and syntactically and semantically unambiguous premises and hypotheses. BNLI comprehensively covers regional dialectal variants and authentic, context-rich utterances, substantially improving linguistic diversity. Benchmarking on multilingual BERT and Bangla-specific language models demonstrates consistent performance gains over prior datasets (+4.2% average accuracy) and improved model interpretability. BNLI thus establishes a reliable evaluation benchmark and a reproducible construction methodology for low-resource NLI.

Technology Category

Application Category

📝 Abstract

Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.

Problem

Research questions and friction points this paper is trying to address.

Limited Bengali NLI datasets with inconsistencies and errors

Inadequate linguistic diversity hindering model training and evaluation

Need for reliable semantic inference in low-resource Bengali language

Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined Bengali NLI dataset with linguistic curation

Rigorous annotation pipeline for semantic clarity and balance

Benchmarked using transformer-based multilingual and Bengali-specific models

🔎 Similar Papers

No similar papers found.