When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work identifies significant cultural bias in multimodal large language models (MLLMs) within cross-cultural mixed scenarios: models over-rely on visual person features, leading to poor robustness in recognizing low-resource cultural entities (e.g., kimchi), with accuracy fluctuations up to 58%. To address this, we introduce MixCuBe—the first cross-cultural bias benchmark covering five countries and four ethnic groups—and propose a vision-language alignment–based evaluation framework. It integrates controlled image perturbations with systematic prompt-based testing for multi-model comparative analysis. Results show that mainstream MLLMs exhibit strong performance on high-resource cultures but are highly sensitive to low-resource ones; GPT-4o suffers the largest performance degradation under such perturbations. This study is the first to quantitatively characterize the cultural robustness gap in MLLMs, establishing a novel benchmark and analytical paradigm for fair multimodal modeling.

Technology Category

Application Category

📝 Abstract

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it. However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures. Our dataset is publicly available at: https://huggingface.co/datasets/kyawyethu/MixCuBe.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural bias in multimodal models

Assessing model accuracy across ethnicities

Addressing misclassification in mixed-cultural contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing MixCuBe for cultural bias evaluation

Testing MLLMs with five countries and four ethnicities

Publicly releasing dataset for cross-cultural analysis

🔎 Similar Papers

Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge