When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies significant cultural bias in multimodal large language models (MLLMs) within cross-cultural mixed scenarios: models over-rely on visual person features, leading to poor robustness in recognizing low-resource cultural entities (e.g., kimchi), with accuracy fluctuations up to 58%. To address this, we introduce MixCuBe—the first cross-cultural bias benchmark covering five countries and four ethnic groups—and propose a vision-language alignment–based evaluation framework. It integrates controlled image perturbations with systematic prompt-based testing for multi-model comparative analysis. Results show that mainstream MLLMs exhibit strong performance on high-resource cultures but are highly sensitive to low-resource ones; GPT-4o suffers the largest performance degradation under such perturbations. This study is the first to quantitatively characterize the cultural robustness gap in MLLMs, establishing a novel benchmark and analytical paradigm for fair multimodal modeling.

Technology Category

Application Category

📝 Abstract
In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural bias in multimodal models
Assessing model accuracy across ethnicities
Addressing misclassification in mixed-cultural contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing MixCuBe for cultural bias evaluation
Testing MLLMs with five countries and four ethnicities
Publicly releasing dataset for cross-cultural analysis
🔎 Similar Papers
No similar papers found.