🤖 AI Summary
Current multistep reasoning evaluation for language models is heavily biased toward high-resource languages like English, with no systematic assessment of cross-lingual reasoning—especially for low-resource languages. To address this gap, we introduce the first manually translated, structurally aligned multistep reasoning benchmark for Bengali, covering both binary and non-binary questions to enable controlled cross-lingual analysis. We conduct experiments with small multilingual models under English-centric and Bengali-centric settings. Results show that reasoning context yields greater gains for non-binary questions, yet all models exhibit consistently weak utilization of Bengali reasoning steps—indicating a fundamental limitation in chain-of-thought comprehension and execution in low-resource settings. This work establishes the first dedicated evaluation framework for multistep reasoning in Bengali and reveals critical bottlenecks in current multilingual models’ ability to reason over morphosyntactically rich, low-resource languages.
📝 Abstract
Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.