BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing ethical benchmarks are predominantly English-centric and grounded in Western moral frameworks, neglecting cultural diversity—particularly the absence of moral reasoning evaluation for Bengali, the world’s sixth most spoken language (285+ million speakers), within its sociocultural context. Method: We introduce the first large-scale, culturally grounded moral evaluation benchmark for Bengali, spanning five domains and fifty culturally sensitive subtopics. We propose the first multidimensional, localization-aware evaluation framework integrating virtue ethics, commonsense reasoning, and justice-based perspectives, validated via native-speaker consensus annotation. Evaluation follows a zero-shot multilingual protocol across Llama, Gemma, Qwen, and DeepSeek models. Contribution/Results: Model accuracy ranges narrowly from 50% to 91%, revealing critical deficiencies in cultural understanding, commonsense moral inference, and fairness-aware reasoning—thereby underscoring both the necessity and complexity of localized ethical alignment.

Technology Category

Application Category

📝 Abstract
As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.
Problem

Research questions and friction points this paper is trying to address.

Evaluating moral reasoning alignment in Bengali LLMs
Addressing cultural bias in existing ethics benchmarks
Assessing multilingual models' ethical performance in low-resource contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed first Bengali ethics benchmark using native cultural contexts
Evaluated multilingual LLMs with unified zero-shot prompting protocol
Identified cultural grounding gaps via virtue-commonsense-justice frameworks
🔎 Similar Papers
No similar papers found.
S
Shahriyar Zaman Ridoy
Computational Intelligence and Operations Laboratory, Cohere Labs Community, North South University
Azmine Toushik Wasi
Azmine Toushik Wasi
Shahjalal University of Science and Technology
Machine LearningAI Agents & ReasoningHealth InformaticsGraph Neural NetworksHCI-HAI & Safety
K
Koushik Ahamed Tonmoy
North South University