🤖 AI Summary
This work addresses the challenging task of reasoning over culturally rich, metaphor-laden, low-resource Bengali traditional riddles. We introduce BanglaRiddleEval—the first fine-grained evaluation benchmark for this domain—comprising 1,244 riddles and covering four subtasks: generative question answering, multiple-choice reasoning, ambiguity identification, and semantic disambiguation. To overcome data scarcity, we propose a novel evaluation framework integrating LLM-based chain-of-thought generation, semantically adversarial distractor construction, and expert-curated ambiguity annotations. Evaluation employs multi-strategy prompting (zero-shot, few-shot, and CoT) and hybrid metrics combining ROUGE, BERTScore, and accuracy. Experimental results reveal fundamental limitations of current multilingual LLMs: the best-performing model achieves only 56% accuracy on multiple-choice reasoning (vs. 83% human performance), ambiguity resolution rates range from 26% to 68%, and while generated answers show moderate semantic overlap with references, their logical correctness remains markedly low—highlighting critical gaps in cultural commonsense and metaphorical reasoning capabilities.
📝 Abstract
Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.