Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the cross-lingual pragmatic competence of large language models (LLMs) on culturally embedded metaphorical expressions—particularly idioms and proverbs—in Arabic (including Egyptian Arabic) versus English. We propose metaphorical language as a diagnostic probe for cultural reasoning ability, introduce *Kinayat*, the first Egyptian Arabic idiom understanding and pragmatic evaluation dataset, and design a multi-task assessment framework covering contextual comprehension, pragmatic generation, and connotative interpretation. Evaluating 22 open- and closed-source LLMs reveals: (1) proverb accuracy in Arabic is 4.29 percentage points lower than in English, dropping further by 10.28 points for Egyptian Arabic idioms; (2) pragmatic task performance lags behind comprehension tasks by 14.07 points, though contextual prompting improves scores by 10.66 points; and (3) connotative interpretation consistency reaches only 85.58% of human annotator agreement. The findings expose systematic deficits in LLMs’ cultural pragmatics, establishing a novel evaluation paradigm and benchmark resource for culturally adaptive language modeling.

Technology Category

Application Category

📝 Abstract
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to process culturally grounded figurative language across cultures
Assessing models' pragmatic usage of figurative expressions beyond literal understanding
Measuring performance gaps in cultural reasoning through contextual interpretation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs' cultural processing with figurative language
Designing tasks for pragmatic use and connotation interpretation
Releasing Kinayat dataset for Arabic idiom evaluation
🔎 Similar Papers
No similar papers found.