BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models

📅 2025-10-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation of cultural understanding capabilities in cross-lingual and cross-modal settings. Method: We introduce CulturaBench, the first multimodal, multicul­tural benchmark designed to assess robustness in cultural understanding. Built upon the BLEnD dataset, it comprises three aligned multiple-choice task categories—text paraphrasing, visual variants, and cross-lingual transfer—each rigorously validated by human annotators for strict image-text semantic alignment. The benchmark includes 4,916 images and over 21,000 questions. Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art VLMs under low-resource language conditions and visual perturbations, alongside weak consistency in multimodal reasoning. This work provides the first quantitative characterization of cross-modal cultural fragility, establishing a critical evaluation framework and actionable insights for developing trustworthy, culturally aware AI systems.

Technology Category

Application Category

📝 Abstract
As vision-language models (VLMs) are deployed globally, their ability to understand culturally situated knowledge becomes essential. Yet, existing evaluations largely assess static recall or isolated visual grounding, leaving unanswered whether VLMs possess robust and transferable cultural understanding. We introduce BLEnD-Vis, a multimodal, multicultural benchmark designed to evaluate the robustness of everyday cultural knowledge in VLMs across linguistic rephrasings and visual modalities. Building on the BLEnD dataset, BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats: (i) a text-only baseline querying from Region $ o$ Entity, (ii) an inverted text-only variant (Entity $ o$ Region), and (iii) a VQA-style version of (ii) with generated images. The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation. BLEnD-Vis reveals significant fragility in current VLM cultural knowledge; models exhibit performance drops under linguistic rephrasing and, whilst visual cues often aid performance, low cross-modal consistency highlights challenges in robustly integrating textual and visual understanding, particularly for lower-resource regions. BLEnD-Vis thus provides a crucial testbed for systematically analysing cultural robustness and multimodal grounding, exposing limitations and guiding the development of more culturally competent VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cultural understanding robustness in vision-language models
Assessing multimodal knowledge transfer across linguistic rephrasings
Measuring cross-modal consistency in culturally grounded visual questions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multicultural benchmark evaluates cultural understanding robustness
Generates text-only and VQA-style multiple-choice question formats
Uses human-annotated images and linguistic rephrasing variations
🔎 Similar Papers
No similar papers found.