Can LLMs Solve My Grandma's Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the challenging task of reasoning over culturally rich, metaphor-laden, low-resource Bengali traditional riddles. We introduce BanglaRiddleEval—the first fine-grained evaluation benchmark for this domain—comprising 1,244 riddles and covering four subtasks: generative question answering, multiple-choice reasoning, ambiguity identification, and semantic disambiguation. To overcome data scarcity, we propose a novel evaluation framework integrating LLM-based chain-of-thought generation, semantically adversarial distractor construction, and expert-curated ambiguity annotations. Evaluation employs multi-strategy prompting (zero-shot, few-shot, and CoT) and hybrid metrics combining ROUGE, BERTScore, and accuracy. Experimental results reveal fundamental limitations of current multilingual LLMs: the best-performing model achieves only 56% accuracy on multiple-choice reasoning (vs. 83% human performance), ambiguity resolution rates range from 26% to 68%, and while generated answers show moderate semantic overlap with references, their logical correctness remains markedly low—highlighting critical gaps in cultural commonsense and metaphorical reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://github.com/Labib1610/BanglaRiddleEval.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs on reasoning with traditional Bangla riddles

Assesses figurative and cultural reasoning in low-resource settings

Introduces a benchmark to measure performance gaps versus humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based pipeline generates Chain-of-Thought explanations

Pipeline creates semantically coherent distractors and ambiguity annotations

Evaluates models with diverse prompting strategies on Bangla riddles

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?