Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit significant biases in cross-cultural figurative language understanding—particularly for non-Western, dialectally diverse cultural expressions such as Arabic proverbs—due to insufficient cultural sensitivity and contextual adaptability. To address this, we introduce Jawaher, the first benchmark dataset covering multiple Arabic dialects, featuring four-layer annotations: original proverb, dialect identification, idiomatic translation, and expert-curated cultural explanations. Distinct from prior work, Jawaher is the first to systematically integrate multi-dialectal proverbs, idiomatic translation, and deep cultural contextualization, thereby filling a critical gap in non-English figurative language evaluation. Through comprehensive automated and human evaluation across both open- and closed-source LLMs, our experiments reveal that while current models generate accurate literal translations, they consistently fail to produce culturally appropriate or contextually grounded interpretations—highlighting the core bottleneck in cross-cultural figurative language comprehension.

Technology Category

Application Category

📝 Abstract
Recent advancements in instruction fine-tuning, alignment methods such as reinforcement learning from human feedback (RLHF), and optimization techniques like direct preference optimization (DPO) have significantly enhanced the adaptability of large language models (LLMs) to user preferences. However, despite these innovations, many LLMs continue to exhibit biases toward Western, Anglo-centric, or American cultures, with performance on English data consistently surpassing that of other languages. This reveals a persistent cultural gap in LLMs, which complicates their ability to accurately process culturally rich and diverse figurative language such as proverbs. To address this, we introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. Jawaher includes proverbs from various Arabic dialects, along with idiomatic translations and explanations. Through extensive evaluations of both open- and closed-source models, we find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations. These findings highlight the need for ongoing model refinement and dataset expansion to bridge the cultural gap in figurative language processing.
Problem

Research questions and friction points this paper is trying to address.

Addressing cultural bias in large language models (LLMs).
Evaluating LLMs' ability to interpret Arabic proverbs.
Bridging the cultural gap in figurative language processing.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Jawaher for Arabic proverb benchmarking
Uses multidialectal dataset for cultural diversity
Evaluates LLMs on idiom translations and explanations
🔎 Similar Papers
No similar papers found.