Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

📅 2026-02-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of multilingual evaluation resources for assessing sociocultural biases in large language models within Latin American contexts. To this end, the authors present LatamQA, the first systematically constructed multilingual multiple-choice question-answering dataset focused on Spanish- and Portuguese-speaking Latin American countries, comprising 26,000 questions. The dataset integrates content from Wikipedia, Wikidata knowledge graphs, geographic and cultural information, and expert input from social scientists, and includes English translations. LatamQA fills a critical gap in evaluating non-English sociocultural biases, and empirical analysis using the dataset reveals that current models demonstrate significantly weaker understanding of Latin American cultural contexts compared to Iberian Spanish culture, with native-language performance consistently outperforming translated versions.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.
Problem

Research questions and friction points this paper is trying to address.

sociocultural bias
Large Language Models
Latin America
non-English languages
cultural representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Wikidata
sociocultural bias
Latin America
multilingual QA dataset
LLM evaluation
🔎 Similar Papers
No similar papers found.
Yannis Karmim
Yannis Karmim
PhD at CNAM
Graph Neural NetworksDeep LearningDynamics
R
Renato Pino
Dept. of Computer Science, Universidad de Chile
H
Hernan Contreras
Institute of International Studies, Universidad de Chile
H
Hernan Lira
Inria Chile Research Center
S
Sebastian Cifuentes
Centro Nacional de Inteligencia Artificial
S
Simon Escoffier
School of Social Work, Pontificia Universidad Católica de Chile
Luis Martí
Luis Martí
Inria
Machine learningneural networksevolutionary computationmulti-objective optimizationmultiobjective optimization
Djamé Seddah
Djamé Seddah
Inria (Almanach)
LLMsdata set developmentlow-resource languagesArabic dialectsUGC
V
Valentin Barrière
Dept. of Computer Science, Universidad de Chile