BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Bangla Natural Language Inference (NLI) datasets suffer from annotation errors, semantic ambiguity, and insufficient linguistic diversity, hindering robust inference research for low-resource languages. To address these limitations, we introduce BNLI—the first linguistically curated, high-quality Bangla NLI dataset—constructed via a hybrid annotation pipeline integrating multi-stage human verification and fine-grained semantic analysis. This process ensures label consistency, balanced class distribution across entailment, neutral, and contradiction relations, and syntactically and semantically unambiguous premises and hypotheses. BNLI comprehensively covers regional dialectal variants and authentic, context-rich utterances, substantially improving linguistic diversity. Benchmarking on multilingual BERT and Bangla-specific language models demonstrates consistent performance gains over prior datasets (+4.2% average accuracy) and improved model interpretability. BNLI thus establishes a reliable evaluation benchmark and a reproducible construction methodology for low-resource NLI.

Technology Category

Application Category

📝 Abstract
Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.
Problem

Research questions and friction points this paper is trying to address.

Limited Bengali NLI datasets with inconsistencies and errors
Inadequate linguistic diversity hindering model training and evaluation
Need for reliable semantic inference in low-resource Bengali language
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined Bengali NLI dataset with linguistic curation
Rigorous annotation pipeline for semantic clarity and balance
Benchmarked using transformer-based multilingual and Bengali-specific models
🔎 Similar Papers
No similar papers found.
F
Farah Binta Haque
Computer Science and Engineering, Brac University, Dhaka, Bangladesh
M
Md Yasin
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
S
Shishir Saha
Computer Science and Engineering, BRAC University, Dhaka, Bangladesh
M
Md Shoaib Akhter Rafi
Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
Farig Sadeque
Farig Sadeque
Associate Professor, BRAC University
Natural Language ProcessingComputational Social Science