PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant deficits in non-English cultural competence due to English-centric training data, with Persian culture notably underassessed. Method: We introduce PerCul—the first narrative-style, multiple-choice benchmark for evaluating LLMs’ understanding of Persian culture—designed and annotated collaboratively by native Persian speakers to ensure cultural authenticity and eliminate translation bias. PerCul integrates cultural modeling, multi-round expert annotation, and a comparative multi-model evaluation framework. Contribution/Results: PerCul is the first systematic assessment revealing substantial gaps in LLMs’ Persian cultural understanding: the best closed-source model underperforms native Persian speakers by 11.3% in accuracy, while the top open-source model lags by 21.3%. This work establishes a rigorous, culturally grounded evaluation paradigm and fills a critical gap in non-English cultural capability assessment.

Technology Category

Application Category

📝 Abstract

Large language models predominantly reflect Western cultures, largely due to the dominance of English-centric training data. This imbalance presents a significant challenge, as LLMs are increasingly used across diverse contexts without adequate evaluation of their cultural competence in non-English languages, including Persian. To address this gap, we introduce PerCul, a carefully constructed dataset designed to assess the sensitivity of LLMs toward Persian culture. PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios. Unlike existing benchmarks, PerCul is curated with input from native Persian annotators to ensure authenticity and to prevent the use of translation as a shortcut. We evaluate several state-of-the-art multilingual and Persian-specific LLMs, establishing a foundation for future research in cross-cultural NLP evaluation. Our experiments demonstrate a 11.3% gap between best closed source model and layperson baseline while the gap increases to 21.3% by using the best open-weight model. You can access the dataset from here: https://huggingface.co/datasets/teias-ai/percul

Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' cultural sensitivity in Persian

Address Western cultural bias in LLMs

Evaluate Persian cultural nuance in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Story-based evaluation questions

Native Persian annotators input

Multilingual LLMs cultural assessment

🔎 Similar Papers

Kalahi: A handcrafted, grassroots cultural LLM evaluation suite for Filipino