BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

📅 2024-06-14
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
Existing evaluations inadequately assess large language models’ (LLMs) understanding of multicultural everyday commonsense knowledge—particularly in non-English and low-resource language settings. To address this gap, we introduce BLEnD, the first multilingual, cross-cultural benchmark explicitly designed to evaluate localized daily-life knowledge. BLEnD spans 16 countries/regions and 13 languages—including low-resource ones such as Amharic—and comprises 52.6K expert-annotated question-answer pairs grounded in authentic scenarios (e.g., birthday cuisine, traditional instruments). Methodologically, it employs ethnographic data collection—replacing web scraping—with dual-format question design, cross-lingual consistency verification, and a zero-shot evaluation framework. Our analysis reveals, for the first time, a counterintuitive “English supremacy” bias in LLMs’ low-resource language performance, alongside systematic cultural representation gaps; GPT-4 exhibits up to a 57.34% cross-cultural accuracy disparity. The dataset is publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.
Problem

Research questions and friction points this paper is trying to address.

Multicultural Common Sense
Language Models
Non-English Contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

BLEnD
Multilingual Capability
Cultural Adaptability
🔎 Similar Papers
No similar papers found.
J
Jun-Hee Myung
KAIST
Nayeon Lee
Nayeon Lee
School of Computing, KAIST
AI EthicsCross-cultural NLPcomputational social scienceNLP
Y
Yi Zhou
Cardiff University
Jiho Jin
Jiho Jin
KAIST
NLPMachine Learning
R
Rifki Afina Putri
KAIST
D
Dimosthenis Antypas
Cardiff University
H
Hsuvas Borkakoty
Cardiff University
Eunsu Kim
Eunsu Kim
KAIST
AINLP
C
Carla Pérez-Almendros
Cardiff University
A
A. Ayele
Universität Hamburg, Bahir Dar University
V
V'ictor Guti'errez-Basulto
Cardiff University
Y
Yazm'in Ib'anez-Garc'ia
Cardiff University
H
Hwaran Lee
NAVER AI Lab
Shamsuddeen Hassan Muhammad
Shamsuddeen Hassan Muhammad
Bayero University, Kano, & Google DeepMind Academic Fellow at Imperial College London
Natural Language ProcessingSentiment AnalysisAfricaNLPLow-resource NLPMultilinguality
K
Kiwoong Park
KAIST
A
A. Rzayev
KAIST
N
Nina White
Cardiff University
S
Seid Muhie Yimam
Universität Hamburg
Mohammad Taher Pilehvar
Mohammad Taher Pilehvar
Cardiff University / TeIAS / Cambridge
Artificial IntelligenceNatural Language ProcessingLexical SemanticsSemantic Representation
N
N. Ousidhoum
Cardiff University
J
José Camacho-Collados
Cardiff University
Alice Oh
Alice Oh
KAIST Computer Science
machine learningNLPcomputational social science