WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit significant limitations in non-English contexts and in understanding marginalized cultural knowledge. Method: We introduce the largest multilingual, multicultural visual question answering benchmark to date—covering 30 languages/dialects across 9 language families and over 1 million image–text pairs—centered on dish name recognition and geographic provenance inference. We propose fine-grained regional annotations, cross-cultural image labeling, region-aware prompting, and adversarial context evaluation. Contribution/Results: This work presents the first systematic assessment of VLMs’ cross-lingual and cross-regional comprehension in global culinary culture. Results reveal improved performance under standard geographic contexts but persistent deficiencies in adversarial settings and fine-grained regional or language prediction. We publicly release the benchmark—including 12K/60K evaluation splits and a 1M-sample training set—alongside a fine-grained culinary knowledge base and an adversarial test suite, establishing a new paradigm for culturally equitable VLM evaluation.

Technology Category

Application Category

📝 Abstract

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs on culture-specific multilingual knowledge

Assesses VLM performance in underrepresented cultural contexts

Challenges VLMs with regional cuisines and language predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual VQA dataset

30 languages and dialects

1 million data points

🔎 Similar Papers

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering