ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the limited evaluation of cultural value understanding in multimodal large language models (MLLMs), which has predominantly focused on textual inputs while neglecting visual contexts. To bridge this gap, the authors propose ValueGround—the first benchmark for assessing culturally grounded visual reasoning—constructed from the World Values Survey (WVS). It employs minimally contrastive image pairs that control for confounding variables, enabling evaluation of whether MLLMs can select images aligned with national cultural orientations without relying on original textual cues. Experiments across six prominent MLLMs and 13 countries reveal a significant performance gap: while image-option alignment accuracy reaches 92.8%, model accuracy on the visual cultural reasoning task averages only 65.8%, notably lower than the 72.8% achieved in textual counterparts, highlighting persistent challenges in cross-modal cultural value comprehension.

Technology Category

Application Category

📝 Abstract

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

Problem

Research questions and friction points this paper is trying to address.

cultural values

visual grounding

multimodal large language models

cross-modal transfer

value judgment

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual value grounding

multimodal large language models

culture-conditioned evaluation