🤖 AI Summary
This work identifies a critical stability deficiency in vision-language models (VLMs) when processing multicultural images: VLMs fail to correctly interpret cultural content when multiple unrelated cultural cues co-occur, due to attentional drift away from semantically relevant features. To systematically evaluate this vulnerability, we introduce ConfusedTourist—the first adversarial benchmark for cultural confusion—generating culturally mixed samples via image stacking and generative perturbations, and employing interpretability methods to trace the degradation of attention mechanisms. Experiments across mainstream VLMs reveal a substantial accuracy drop (average −32.7%), exposing their fundamental inability to disentangle and reason over composite cultural signals. Our contribution is twofold: (1) a standardized, challenging benchmark for evaluating cultural robustness in VLMs; and (2) the first systematic diagnosis of failure mechanisms under cultural interference, providing both theoretical insights and actionable directions for improving cross-cultural generalization.
📝 Abstract
Although the cultural dimension has been one of the key aspects in evaluating Vision-Language Models (VLMs), their ability to remain stable across diverse cultural inputs remains largely untested, despite being crucial to support diversity and multicultural societies. Existing evaluations often rely on benchmarks featuring only a singular cultural concept per image, overlooking scenarios where multiple, potentially unrelated cultural cues coexist. To address this gap, we introduce ConfusedTourist, a novel cultural adversarial robustness suite designed to assess VLMs' stability against perturbed geographical cues. Our experiments reveal a critical vulnerability, where accuracy drops heavily under simple image-stacking perturbations and even worsens with its image-generation-based variant. Interpretability analyses further show that these failures stem from systematic attention shifts toward distracting cues, diverting the model from its intended focus. These findings highlight a critical challenge: visual cultural concept mixing can substantially impair even state-of-the-art VLMs, underscoring the urgent need for more culturally robust multimodal understanding.