On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies systematic cultural biases in language models toward non-Western cultural entities—particularly Arabic ones—arising from cultural imbalance in pretraining corpora and tokenization sensitivity to polysemy and script mixing. To address this, we introduce CAMeL-2, the first Arabic–English parallel cultural entity benchmark (58K entities, 367 masked contexts), and conduct masked language modeling evaluation, cross-lingual frequency and ambiguity analysis, and tokenization-attribution experiments. Our findings provide the first empirical evidence that frequency-driven tokenization amplifies representational bias for high-frequency polysemous Arabic cultural entities and non-Arabic entities sharing the Arabic script; such bias intensifies with larger vocabularies but markedly diminishes in English. These results establish a novel benchmark and attribution framework for advancing cross-lingual cultural fairness in language modeling.

Technology Category

Application Category

📝 Abstract
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
Problem

Research questions and friction points this paper is trying to address.

Cultural Bias
Multilingual Language Models
Arabic Script
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cultural Bias in Language Models
CAMeL-2 Test
Tokenization Strategy Impact
🔎 Similar Papers
No similar papers found.