🤖 AI Summary
This study addresses the susceptibility of large vision-language models (LVLMs) to cultural cues—such as those signaling religion, nationality, or socioeconomic status—which can lead to biased moral and value judgments that lack sensitivity to cross-cultural differences. To systematically evaluate this issue, the authors propose a novel assessment framework that integrates Moral Foundations Theory with natural language lexical analysis to construct counterfactual image sets. Applying this framework to five state-of-the-art LVLMs, the work reveals a consistent overreliance on cultural symbols, resulting in outputs that significantly deviate from cross-cultural fairness. These findings underscore a critical gap in value alignment within current LVLMs, highlighting the urgent need for more culturally aware and equitable model design.
📝 Abstract
The rapid adoption of large vision-language models (LVLMs) in recent years has been accompanied by growing fairness concerns due to their propensity to reinforce harmful societal stereotypes. While significant attention has been paid to such fairness concerns in the context of social biases, relatively little prior work has examined the presence of stereotypes in LVLMs related to cultural contexts such as religion, nationality, and socioeconomic status. In this work, we aim to narrow this gap by investigating how cultural contexts depicted in images influence the judgments LVLMs make about a person's moral, ethical, and political values. We conduct a multi-dimensional analysis of such value judgments in five popular LVLMs using counterfactual image sets, which depict the same person across different cultural contexts. Our evaluation framework diagnoses LVLM awareness of cultural value differences through the use of Moral Foundations Theory, lexical analyses, and the sensitivity of generated values to depicted cultural contexts.