Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Current multistep reasoning evaluation for language models is heavily biased toward high-resource languages like English, with no systematic assessment of cross-lingual reasoning—especially for low-resource languages. To address this gap, we introduce the first manually translated, structurally aligned multistep reasoning benchmark for Bengali, covering both binary and non-binary questions to enable controlled cross-lingual analysis. We conduct experiments with small multilingual models under English-centric and Bengali-centric settings. Results show that reasoning context yields greater gains for non-binary questions, yet all models exhibit consistently weak utilization of Bengali reasoning steps—indicating a fundamental limitation in chain-of-thought comprehension and execution in low-resource settings. This work establishes the first dedicated evaluation framework for multistep reasoning in Bengali and reveals critical bottlenecks in current multilingual models’ ability to reason over morphosyntactically rich, low-resource languages.

Technology Category

Application Category

📝 Abstract

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluate multilingual models on Bangla reasoning tasks

Compare reasoning performance across English and Bangla

Assess impact of reasoning steps on model predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Manually translated Bangla reasoning dataset

Evaluated multilingual small language models

Analyzed reasoning steps' impact across languages

🔎 Similar Papers

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models