See It from My Perspective: How Language Affects Cultural Bias in Image Understanding

๐Ÿ“… 2024-06-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study identifies significant Western cultural bias in vision-language models (VLMs) during image understanding, tracing its root to insufficient linguistic representation in the text pretraining phase. Method: We construct a cross-cultural imageโ€“text benchmark, conduct multilingual controlled inference experiments, and integrate objective/subjective visual task evaluations with language representation attribution analysis. Contribution/Results: We provide the first empirical evidence that augmenting non-English pretraining alone elevates VLMsโ€™ performance on East Asian cultural content to parity with Western performance. Moreover, even when using English prompts, prior sufficient exposure to culturally aligned languages reduces bias by 37%. Experiments show VLMs achieve an average 12.3% higher accuracy on Western cultural images. This work establishes a theoretical foundation and scalable intervention framework for diagnosing and mitigating cultural bias in multimodal models.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from East Asian cultures attend more to scene context. In this work, we characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity. We evaluate VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western split than on the East Asian split of each task. Through controlled experimentation, we trace one source of this bias in image understanding to the lack of diversity in language model construction. While inference in a language nearer to a culture can lead to reductions in bias, we show it is much more effective when that language was well-represented during text-only pre-training. Interestingly, this yields bias reductions even when prompting in English. Our work highlights the importance of richer representation of all languages in building equitable VLMs.
Problem

Research questions and friction points this paper is trying to address.

Characterize Western bias in vision-language models.
Investigate language's role in cultural bias disparity.
Highlight need for diverse language representation in VLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates VLMs on culturally diverse image tasks
Traces bias to language model construction diversity
Shows language pre-training reduces cultural bias