CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 25
Influential: 0
📄 PDF
🤖 AI Summary
Existing cultural knowledge evaluation benchmarks for language models suffer from uneven geographical coverage, monolithic question design, and inadequate modeling of multi-answer questions. Method: We introduce CulturalBench—the first global, multicultural evaluation benchmark—comprising 1,227 manually authored, five-stage-verified questions spanning 45 regions (including underrepresented ones like Bangladesh and Zimbabwe) and 17 cultural domains. It employs a dual-difficulty (Easy/Hard) evaluation paradigm to systematically probe model sensitivity to question phrasing and convergence bias under multiple valid answers. Contribution/Results: CulturalBench reveals pronounced LLM deficiencies in South American and Middle Eastern cultural knowledge. Experiments show GPT-4o achieves 61.5% accuracy on the Hard subset (vs. human 92.6%), while Llama3-8b scores only 21.4%. OpenAI models outperform others across all regions except Oceania.

Technology Category

Application Category

📝 Abstract
To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.
Problem

Research questions and friction points this paper is trying to address.

Assessing cultural knowledge in language models globally
Evaluating LM performance on underrepresented cultural regions
Identifying LM biases in handling multiple correct answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-AI teaming creates diverse cultural questions
Five annotators verify each question for accuracy
Benchmark challenges LMs with tricky multiple-answer questions
🔎 Similar Papers
No similar papers found.