Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) systematically fail cultural competence evaluation, primarily due to the absence of structured modeling of cultural nuances in images. This paper introduces the first interdisciplinary analytical framework integrating cultural studies, semiotics, and visual theory to systematically identify and annotate five core cultural dimensions—namely, power relations, identity representation, historical context, spatial politics, and ritual practice. Departing from purely data-driven approaches, our framework employs theory-guided conceptual mapping and a dimension-aware image annotation paradigm to enable interpretable diagnostic assessment of VLMs’ cultural representational capacity. The resulting methodology transcends the limitations of conventional benchmarks by establishing a rigorous theoretical foundation for cultural competence evaluation. It further provides actionable pathways for bias溯源 (bias tracing), cultural auditing, and the design of inclusive VLMs—thereby advancing both methodological rigor and sociotechnical accountability in multimodal AI.

Technology Category

Application Category

📝 Abstract
Modern vision-language models (VLMs) often fail at cultural competency evaluations and benchmarks. Given the diversity of applications built upon VLMs, there is renewed interest in understanding how they encode cultural nuances. While individual aspects of this problem have been studied, we still lack a comprehensive framework for systematically identifying and annotating the nuanced cultural dimensions present in images for VLMs. This position paper argues that foundational methodologies from visual culture studies (cultural studies, semiotics, and visual studies) are necessary for cultural analysis of images. Building upon this review, we propose a set of five frameworks, corresponding to cultural dimensions, that must be considered for a more complete analysis of the cultural competencies of VLMs.
Problem

Research questions and friction points this paper is trying to address.

VLMs lack cultural competency in evaluations and benchmarks
No comprehensive framework exists for analyzing cultural dimensions in VLMs
Cultural theory methodologies are needed for VLM image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilize visual culture studies methodologies
Propose five cultural dimension frameworks
Systematically analyze VLM cultural competencies
🔎 Similar Papers
No similar papers found.
S
Srishti Yadav
Department of Computer Science, University of Copenhagen, Denmark; Pioneer Centre of AI, Denmark
L
Lauren Tilton
Department of Rhetoric and Communication Studies, University of Richmond, U.S.A.
Maria Antoniak
Maria Antoniak
Pioneer Centre for AI, University of Copenhagen
natural language processingcultural analytics
T
Taylor Arnold
Department of Data Science and Statistics, University of Richmond, U.S.A.
Jiaang Li
Jiaang Li
University of Copenhagen
Computer VisionMultimodalityNatural Language Processing
Siddhesh Pawar
Siddhesh Pawar
Google
NLPML
Antonia Karamolegkou
Antonia Karamolegkou
PhD student, University of Copenhagen
Natural Language ProcessingMachine Learning
S
Stella Frank
Department of Computer Science, University of Copenhagen, Denmark; Pioneer Centre of AI, Denmark
Z
Zhaochong An
Department of Computer Science, University of Copenhagen, Denmark; Pioneer Centre of AI, Denmark
Negar Rostamzadeh
Negar Rostamzadeh
Research Scientist at Google Research
Machine LearningComputer VisionResponsible AI
Daniel Hershcovich
Daniel Hershcovich
University of Copenhagen
Natural Language Processing
Serge Belongie
Serge Belongie
University of Copenhagen
Computer VisionMachine Learning
E
E. Shutova
ILLC, University of Amsterdam, Netherlands